SlideShare une entreprise Scribd logo
1  sur  48
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive Present and Future
Yifeng Jiang
Solutions Engineer, Hortonworks, inc.
July 23, 2015
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
About Me
蒋 燚峰 (Yifeng Jiang)
• Solutions Engineer, Hortonworks inc.
• HBase book author
• Hobbies: hiking, watching movie
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
• Apache Hive Present
• How Hive Achieved 100x Performance
• Sub-second Response
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop for the Enterprise:
Implement a Modern Data Architecture with HDP
Customer Momentum
• 430+ customers (as of March 31, 2015)
• 105 customers added in Q1 2015
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1100+ Ecosystem Partners
Apache Project Committers
PMC
Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 36 28
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 11 n/a
TOTAL 164 109
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hortonworks Data Platform (HDP) 2.2 Stack
Hive: SQL on Hadoop
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive Present
Transaction, Security, Performance
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: SQL on Hadoop
• OSS data warehouse built on top of Hadoop
• First Apache Hive released in 2009
• Initial goal was to write MapReduce jobs in SQL
– Most query ran from minutes to hours
– Primary used for batch processing
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive – Single tool for all SQL use cases
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sentiment, Web Data
Sensor. Machine Data
Geolocation
Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Hive Scales to Any Workload
Page 9
Hive at Facebook
• 100+ PB of data under management
• 15+ TB of data loaded daily
• 60,000+ Hive queries per day
• More than 1,000 users per day
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transactions
Insert, Update and Delete SQL Statements
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transaction Use Cases
Reporting with Analytics (YES)
Reporting on data with occasional updates
Corrections to the fact tables, evolving dimension tables
Low concurrency updates, low TPS
Operational (OLTP) Database (NO)
Small Transactions, each doing single line inserts
High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive
Replication
Analytics Modifications
Hive
High Concurrency
OLTP
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Deep Dive: Transaction
Transaction Support in Hive with ACID semantics
• Hive native support for INSERT, UPDATE, DELETE.
• Split Into Phases:
• Phase 1: Hive Streaming Ingest (append)
• Phase 2: INSERT / UPDATE / DELETE Support
• Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Done]
[Done]
[Next]
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
1. Original File
Task reads the latest
ORCFile
Task
Read-
Optimized
ORCFile
Task Task
2. Edits Made
Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged
Task reads the
updated ORCFile
Hive ACID Compactor
periodically merges the delta
files in the background.
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Compaction
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
Read-
Optimized
ORCFile
Delta File
Delta File
Delta File
Minor Compaction
10% local
Major Compaction
10% global
Minor / Major compaction
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Security
Hive User’s perspective
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ranger: Central Security Administration
Apache Ranger
• Security dashboard
• Centralizes administration of
security policy
• Ensures consistent coverage
across the entire Hadoop stack
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Setup Authorization Policy (Hive)
16
file level
access control,
flexible
definition
Control
permissions
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
How Hive Achieved 100x
Performance
ORC, Tez, CBO, Vectorization
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Speed: The Stinger Initiative
Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.
Launched: February 2013; Delivered: April 2014.
Delivered in 100% Apache Open Source.
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORCFile
= 100X+ +
Distributed
Execution
Apache Tez
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
TPC-DS Benchmark at 30 Terabyte Scale
Sample of 50 queries from TPC-DS at 30 terabyte scale.
Average 52x Query Speedup, Maximum 160x Query Speedup.
Total benchmark time decreased from 7.8 days to 9.3 hours.(3)
Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC File Format
Columnar Storage for Hive
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Columnar Storage for Hive
• Columns stored separately
• Knows types
–Uses type-specific encoders
–Stores statistics (min, max, sum, count)
• Has light-weight index
–Skip over blocks of rows that don’t matter
Page 21
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Columnar Storage for Hive
Large block size ideal for
map/reduce.
Columnar format enables
high compression and high
performance.
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Create Table
• Defined at table or partition level
• Configurable compression codec
Page 23
create table Addresses (
name string,
street string,
city string,
state string,
zip int
) stored as orc tblproperties ("orc.compress"=”ZLIB");
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Convert Text to ORC
• Always ORC
• One SQL to convert text to ORC
Page 24
-- Create Text & ORC tables
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
-- Load into Text table
LOAD DATA LOCAL INPATH '/home/user/test_details.csv' INTO TABLE test_details_txt;
-- Copy to ORC table
INSERT OVERWRITE INTO test_details_orc SELECT * FROM test_details_txt;
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez Engine
Beyond MapReduce
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1 ( Join a & b )
Job 3 ( Group by of c )
Job 2 (Group by of a
Join b)
Job 4 (Join of S & R )
Hive - MR
MR vs. Tez Example
Page 26
Single Job
Hive - Tez
Join a & b
Group by of a Join b
Group by of c
Job 4 (Join of S & R )
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Tez – Introduction
Page 27
• Distributed execution framework for
data-processing applications
–Target for application (framework), not end
user
–Hive on Tez, Pig on Tez, Cascading on Tez, …
• Lessons learned from MapReduce
–Significant performance improvement
–Batch, interactive
–Petabytes scale
• Run on YARN
–Utilize cluster resource
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Tez – Switch from MapReduce
• One command to switch from MapReduce to Tez
Page 28
set hive.execution.engine=tez;
SELECT * FROM my_table;
• Set Tez as default engine on Hadoop 2
$ vi hive-site.xml
hive.execution.engine=tez
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cost Based Optimizer
Making the SQL smarter
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cost Based Optimizer in Hive
Cost-Based Optimizer (CBO) creates optimized execution plan using
Hive table statistics
Why cost-based optimization?
• Simple use – e.g., adjust join order automatically
• Reduce the need for SQL tuning
• Optimized plan relates to better cluster utilization
Page 30
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Performance Improvement – Query 17
Scale = 30TB
Input records ~186M
CBO Elapsed
Time (sec)
Elapsed
Time
Intermediate
data (GB)
Output and
Intermediate
Records
OFF 10,683 ~3 hrs 5,017 135,647,792,123
ON 1,284 ~20 mins 275 8,543,232,360
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
CBO – Enable CBO
• Enable CBO before submitting query
Page 32
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
• Refresh statistics
ANALYZE TABLE my_table COMPUTE STATISTICS FOR COLUMNS;
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Vectorized Query Execution
Process 1024 Rows at a Time
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Vectorization – Vectorized SQL Engine
• Feature:
–Process a block of 1024 rows instead of one row at a time
–Leverage modern hardware architecture
• Benefit:
–Max to 3x faster for big query
–Reduce CPU time, utilize cluster resource
Page 34
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Vectorization – Enable Vectorization
• Enable vectorized SQL engine
Page 35
set hive.vectorized.execution.enabled = true
set hive.vectorized.execution.reduce.enabled = true;
• Support ORC only
• A few data types and features are not supported
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive on Tez: Conclusion
Hive on Tez delivers fast batch and interactive SQL today.
But users need more speed!
Proven at petabyte scale.
Scalei
The most comprehensive
open-source SQL on
Hadoop.
SQLi
More than 90 Hortonworks
customers use Hive-on-Tez
today for fast SQL.
Speedi
Hortonworks Customer Support metrics as of Feb/2015
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sub-second Query Response
Solving Hive’s Top Performance Challenges
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Next Stop: Stinger.next and Sub-Second SQL
Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach.
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez
Historical
Current
In Development
Legend
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez
Vector Cache
LLAP
Persistent Server
Historical
Current
In Development
Legend
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
HBase Meta store: Why?
Page 41Hive & HBase For Transaction Processing
700+ metastore
queries to create
execution plan!
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: What
Page 42Hive & HBase For Transaction Processing
Node
LLAP
Process
HDFS
Query
Fragm
ent
LLAP In-Memory
columnar cache
LLAP process
running read task
for a query
LLAP process runs on multiple nodes, accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP
LLAP = Live Long And Process
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: Why?
Page 43
• LLAP is a node resident daemon process
– Low latency by reducing setup cost
• LLAP has in-memory columnar data cache
– Hot data sits in memory, not HDFS
– Store data in columnar format for vectorization
processing
• Use YARN for resource management
– Utilize cluster resource
Node
LLAP Process
Query
Fragment
LLAP In-
Memory
columnar
cache
LLAP
process
running a
task for a
query
HDFS
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Sub-second Response
=
Sub-Second
Hive
Metadata
Fast, Scalable
Metadata
Catalog
Persistent
Server
LLAP
+ +
SQL Engine
Vectorized
Hash Join
Choice of
Execution
Engines
Tez
+
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Key Takeaways
Hive Present and Future
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Present and Future
• Hive is the de facto standard of SQL on Hadoop
• One tool, batch and interactive processing
• One tool, all big data SQL use cases: ETL, reporting, BI and analytics
• Hive keeps envolving
• SQL:2011 Analytics support
• Enhance transactions
• Sub-second query response
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Hive Today
• Try Hive latest feature today
• Hive on Tez
• ORC file formant
• CBO
• Vectorization
• Just a few lines of configuration/SQL change
• Stay tuned for Hive evolution
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank you
Yifeng Jiang, Solutions Engineer, Hortonworks
@uprush

Contenu connexe

Tendances

Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 

Tendances (20)

Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
MiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talkMiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talk
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Introduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWSIntroduction to Hortonworks Data Cloud for AWS
Introduction to Hortonworks Data Cloud for AWS
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in Nutshell
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 

Similaire à Hive present-and-feature-shanghai

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 

Similaire à Hive present-and-feature-shanghai (20)

An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 

Plus de Yifeng Jiang

Plus de Yifeng Jiang (20)

Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
introduction-to-apache-kafka
introduction-to-apache-kafkaintroduction-to-apache-kafka
introduction-to-apache-kafka
 
Hive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big DataHive2 Introduction -- Interactive SQL for Big Data
Hive2 Introduction -- Interactive SQL for Big Data
 
Introduction to Streaming Analytics Manager
Introduction to Streaming Analytics ManagerIntroduction to Streaming Analytics Manager
Introduction to Streaming Analytics Manager
 
HDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for EveryoneHDF 3.0 IoT Platform for Everyone
HDF 3.0 IoT Platform for Everyone
 
Hortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 UpdatesHortonworks Data Cloud for AWS 1.11 Updates
Hortonworks Data Cloud for AWS 1.11 Updates
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
Real-time Analytics in Financial
Real-time Analytics in FinancialReal-time Analytics in Financial
Real-time Analytics in Financial
 
sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16sparksql-hive-bench-by-nec-hwx-at-hcj16
sparksql-hive-bench-by-nec-hwx-at-hcj16
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Yifeng hadoop-present-public
Yifeng hadoop-present-publicYifeng hadoop-present-public
Yifeng hadoop-present-public
 
Hive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-publicHive-sub-second-sql-on-hadoop-public
Hive-sub-second-sql-on-hadoop-public
 
Yifeng spark-final-public
Yifeng spark-final-publicYifeng spark-final-public
Yifeng spark-final-public
 
Kinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-diveKinesis vs-kafka-and-kafka-deep-dive
Kinesis vs-kafka-and-kafka-deep-dive
 
Apache Hiveの今とこれから
Apache Hiveの今とこれからApache Hiveの今とこれから
Apache Hiveの今とこれから
 
HDFS Deep Dive
HDFS Deep DiveHDFS Deep Dive
HDFS Deep Dive
 
Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2Hadoop Trends & Hadoop on EC2
Hadoop Trends & Hadoop on EC2
 
Apache Ambari Overview -- Hadoop for Everyone
Apache Ambari Overview -- Hadoop for EveryoneApache Ambari Overview -- Hadoop for Everyone
Apache Ambari Overview -- Hadoop for Everyone
 
HDP Security Overview
HDP Security OverviewHDP Security Overview
HDP Security Overview
 
Data Science on Hadoop
Data Science on HadoopData Science on Hadoop
Data Science on Hadoop
 

Dernier

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Dernier (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 

Hive present-and-feature-shanghai

  • 1. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive Present and Future Yifeng Jiang Solutions Engineer, Hortonworks, inc. July 23, 2015
  • 2. © Hortonworks Inc. 2011 – 2015. All Rights Reserved About Me 蒋 燚峰 (Yifeng Jiang) • Solutions Engineer, Hortonworks inc. • HBase book author • Hobbies: hiking, watching movie
  • 3. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Agenda • Apache Hive Present • How Hive Achieved 100x Performance • Sub-second Response
  • 4. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Customer Momentum • 430+ customers (as of March 31, 2015) • 105 customers added in Q1 2015 Hortonworks Data Platform • Completely open multi-tenant platform for any app & any data. • A centralized architecture of consistent enterprise services for resource management, security, operations, and governance. Partner for Customer Success • Open source community leadership focus on enterprise needs • Unrivaled world class support • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • 600+ Employees • 1100+ Ecosystem Partners Apache Project Committers PMC Members Hadoop 27 21 Pig 5 5 Hive 18 6 Tez 16 15 HBase 6 4 Phoenix 4 4 Accumulo 2 2 Storm 3 2 Slider 11 11 Falcon 5 3 Flume 1 1 Sqoop 1 1 Ambari 36 28 Oozie 3 2 Zookeeper 2 1 Knox 13 3 Ranger 11 n/a TOTAL 164 109
  • 5. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hortonworks Data Platform (HDP) 2.2 Stack Hive: SQL on Hadoop
  • 6. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive Present Transaction, Security, Performance
  • 7. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: SQL on Hadoop • OSS data warehouse built on top of Hadoop • First Apache Hive released in 2009 • Initial goal was to write MapReduce jobs in SQL – Most query ran from minutes to hours – Primary used for batch processing
  • 8. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive – Single tool for all SQL use cases OLTP, ERP, CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation Interactive Analytics Batch Reports / Deep Analytics Hive - SQL ETL / ELT
  • 9. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Hive Scales to Any Workload Page 9 Hive at Facebook • 100+ PB of data under management • 15+ TB of data loaded daily • 60,000+ Hive queries per day • More than 1,000 users per day
  • 10. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Transactions Insert, Update and Delete SQL Statements
  • 11. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Transaction Use Cases Reporting with Analytics (YES) Reporting on data with occasional updates Corrections to the fact tables, evolving dimension tables Low concurrency updates, low TPS Operational (OLTP) Database (NO) Small Transactions, each doing single line inserts High Concurrency - Hundreds to thousands of connections Hive OLTP Hive Replication Analytics Modifications Hive High Concurrency OLTP
  • 12. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Deep Dive: Transaction Transaction Support in Hive with ACID semantics • Hive native support for INSERT, UPDATE, DELETE. • Split Into Phases: • Phase 1: Hive Streaming Ingest (append) • Phase 2: INSERT / UPDATE / DELETE Support • Phase 3: BEGIN / COMMIT / ROLLBACK Txn [Done] [Done] [Next] Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile 1. Original File Task reads the latest ORCFile Task Read- Optimized ORCFile Task Task 2. Edits Made Task reads the ORCFile and merges the delta file with the edits 3. Edits Merged Task reads the updated ORCFile Hive ACID Compactor periodically merges the delta files in the background.
  • 13. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Compaction Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile Read- Optimized ORCFile Delta File Delta File Delta File Minor Compaction 10% local Major Compaction 10% global Minor / Major compaction
  • 14. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Security Hive User’s perspective
  • 15. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Ranger: Central Security Administration Apache Ranger • Security dashboard • Centralizes administration of security policy • Ensures consistent coverage across the entire Hadoop stack
  • 16. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Setup Authorization Policy (Hive) 16 file level access control, flexible definition Control permissions
  • 17. © Hortonworks Inc. 2011 – 2015. All Rights Reserved How Hive Achieved 100x Performance ORC, Tez, CBO, Vectorization
  • 18. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Need for Speed: The Stinger Initiative Stinger: An Open Roadmap to improve Apache Hive’s performance 100x. Launched: February 2013; Delivered: April 2014. Delivered in 100% Apache Open Source. SQL Engine Vectorized SQL Engine Columnar Storage ORCFile = 100X+ + Distributed Execution Apache Tez
  • 19. © Hortonworks Inc. 2011 – 2015. All Rights Reserved TPC-DS Benchmark at 30 Terabyte Scale Sample of 50 queries from TPC-DS at 30 terabyte scale. Average 52x Query Speedup, Maximum 160x Query Speedup. Total benchmark time decreased from 7.8 days to 9.3 hours.(3) Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
  • 20. © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC File Format Columnar Storage for Hive
  • 21. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 ORCFile – Columnar Storage for Hive • Columns stored separately • Knows types –Uses type-specific encoders –Stores statistics (min, max, sum, count) • Has light-weight index –Skip over blocks of rows that don’t matter Page 21
  • 22. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 ORCFile – Columnar Storage for Hive Large block size ideal for map/reduce. Columnar format enables high compression and high performance.
  • 23. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 ORCFile – Create Table • Defined at table or partition level • Configurable compression codec Page 23 create table Addresses ( name string, street string, city string, state string, zip int ) stored as orc tblproperties ("orc.compress"=”ZLIB");
  • 24. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 ORCFile – Convert Text to ORC • Always ORC • One SQL to convert text to ORC Page 24 -- Create Text & ORC tables CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE; CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC; -- Load into Text table LOAD DATA LOCAL INPATH '/home/user/test_details.csv' INTO TABLE test_details_txt; -- Copy to ORC table INSERT OVERWRITE INTO test_details_orc SELECT * FROM test_details_txt;
  • 25. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez Engine Beyond MapReduce
  • 26. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 I/O Synchronization Barrier I/O Synchronization Barrier Job 1 ( Join a & b ) Job 3 ( Group by of c ) Job 2 (Group by of a Join b) Job 4 (Join of S & R ) Hive - MR MR vs. Tez Example Page 26 Single Job Hive - Tez Join a & b Group by of a Join b Group by of c Job 4 (Join of S & R )
  • 27. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Tez – Introduction Page 27 • Distributed execution framework for data-processing applications –Target for application (framework), not end user –Hive on Tez, Pig on Tez, Cascading on Tez, … • Lessons learned from MapReduce –Significant performance improvement –Batch, interactive –Petabytes scale • Run on YARN –Utilize cluster resource
  • 28. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Tez – Switch from MapReduce • One command to switch from MapReduce to Tez Page 28 set hive.execution.engine=tez; SELECT * FROM my_table; • Set Tez as default engine on Hadoop 2 $ vi hive-site.xml hive.execution.engine=tez
  • 29. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Cost Based Optimizer Making the SQL smarter
  • 30. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Cost Based Optimizer in Hive Cost-Based Optimizer (CBO) creates optimized execution plan using Hive table statistics Why cost-based optimization? • Simple use – e.g., adjust join order automatically • Reduce the need for SQL tuning • Optimized plan relates to better cluster utilization Page 30
  • 31. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Performance Improvement – Query 17 Scale = 30TB Input records ~186M CBO Elapsed Time (sec) Elapsed Time Intermediate data (GB) Output and Intermediate Records OFF 10,683 ~3 hrs 5,017 135,647,792,123 ON 1,284 ~20 mins 275 8,543,232,360
  • 32. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 CBO – Enable CBO • Enable CBO before submitting query Page 32 set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; • Refresh statistics ANALYZE TABLE my_table COMPUTE STATISTICS FOR COLUMNS;
  • 33. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Vectorized Query Execution Process 1024 Rows at a Time
  • 34. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Vectorization – Vectorized SQL Engine • Feature: –Process a block of 1024 rows instead of one row at a time –Leverage modern hardware architecture • Benefit: –Max to 3x faster for big query –Reduce CPU time, utilize cluster resource Page 34
  • 35. © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Vectorization – Enable Vectorization • Enable vectorized SQL engine Page 35 set hive.vectorized.execution.enabled = true set hive.vectorized.execution.reduce.enabled = true; • Support ORC only • A few data types and features are not supported
  • 36. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive on Tez: Conclusion Hive on Tez delivers fast batch and interactive SQL today. But users need more speed! Proven at petabyte scale. Scalei The most comprehensive open-source SQL on Hadoop. SQLi More than 90 Hortonworks customers use Hive-on-Tez today for fast SQL. Speedi Hortonworks Customer Support metrics as of Feb/2015
  • 37. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sub-second Query Response Solving Hive’s Top Performance Challenges
  • 38. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Next Stop: Stinger.next and Sub-Second SQL Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach.
  • 39. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: Modern ArchitectureStorage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Historical Current In Development Legend
  • 40. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: Modern ArchitectureStorage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Vector Cache LLAP Persistent Server Historical Current In Development Legend
  • 41. © Hortonworks Inc. 2011 – 2015. All Rights Reserved HBase Meta store: Why? Page 41Hive & HBase For Transaction Processing 700+ metastore queries to create execution plan!
  • 42. © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP: What Page 42Hive & HBase For Transaction Processing Node LLAP Process HDFS Query Fragm ent LLAP In-Memory columnar cache LLAP process running read task for a query LLAP process runs on multiple nodes, accelerating Tez tasks Node Hive Query Node NodeNode Node LLAP LLAP LLAP LLAP LLAP = Live Long And Process
  • 43. © Hortonworks Inc. 2011 – 2015. All Rights Reserved LLAP: Why? Page 43 • LLAP is a node resident daemon process – Low latency by reducing setup cost • LLAP has in-memory columnar data cache – Hot data sits in memory, not HDFS – Store data in columnar format for vectorization processing • Use YARN for resource management – Utilize cluster resource Node LLAP Process Query Fragment LLAP In- Memory columnar cache LLAP process running a task for a query HDFS
  • 44. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Sub-second Response = Sub-Second Hive Metadata Fast, Scalable Metadata Catalog Persistent Server LLAP + + SQL Engine Vectorized Hash Join Choice of Execution Engines Tez +
  • 45. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Key Takeaways Hive Present and Future
  • 46. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Present and Future • Hive is the de facto standard of SQL on Hadoop • One tool, batch and interactive processing • One tool, all big data SQL use cases: ETL, reporting, BI and analytics • Hive keeps envolving • SQL:2011 Analytics support • Enhance transactions • Sub-second query response
  • 47. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Try Hive Today • Try Hive latest feature today • Hive on Tez • ORC file formant • CBO • Vectorization • Just a few lines of configuration/SQL change • Stay tuned for Hive evolution
  • 48. © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank you Yifeng Jiang, Solutions Engineer, Hortonworks @uprush

Notes de l'éditeur

  1. 花粉症。。
  2. 聞いた方手を上げて
  3. Hortonworks has a singular focus - enabling Apache Hadoop as an enterprise data platform for any app and any data type We were founded in 2011 by 24 developers from Yahoo where Hadoop was conceived to address data challenges at internet scale. What we now know of as Hadoop really started in 2005, when a team at Yahoo was directed to build out a large-scale data storage and processing technology that would allow them to improve their most critical application, Search. Their challenge was essentially two-fold. First they needed to capture and archive the contents of the internet, and then process the data so that users could search through it effectively an efficiently. Clearly traditional approaches were both technically (due to the size of the data) and commercially (due to the cost) impractical. The result was the Apache Hadoop project that delivered large scale storage (HDFS) and processing (MapReduce). Today we are over 600 employees and have partnered with over 1000 companies who are the leaders in the data center We have also been very fortunate to achieve very significant customer adoption with over 330 customers as of the end of 2014, spanning nearly every vertical.   Hortonworks was founded the sole intent to make Hadoop an enterprise data platform. With YARN as its foundation, HDP delivers a centralized architecture with true multi-tenancy for data-processing and shared services for Security, Governance and Operations to satisfy enterprise requirements, all deeply integrated and certified with leading datacenter technologies. We are uniquely focused on this transformation of Hadoop and doing our work completely in open source. This is all predicated on our leadership in the community, which enables not only to best support users of but also provides uniquely present customer requirements within this open, thriving community.      
  4. 私の日本語力では。。。
  5. Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.
  6. LLAP: Persistent servers cache vectors and start queries instantly. Pluggable integrations with Tez Vectorized Hash Join Solves CPU Boundedness for Hive on Tez. Improved metadata catalog allows instant query planning and optimization for any engine.
  7. LLAP is a node resident daemon process Low latency by reducing setup cost Multi-threaded engine that runs smaller tasks for query including reads, filter and some joins Use regular Tez tasks for larger shuffle and other operators LLAP has In-memory columnar data cache High throughput IO using Async IO Elevator with dedicated thread and core per disk Low latency by providing data from in-memory (off heap) cache instead of going to HDFS Store data in columnar format for vectorization irrespective of underlying file type Security enforced across queries and users Uses YARN for resource management