2. An Introduction to Hive:
Components and Query Language
Jeff Hammerbacher
Chief Scientist and VP of Product
October 30, 2008
3. Hive Components
A Leaky Database
▪ Hadoop
▪ HDFS
▪ MapReduce (bundles Resource Manager and Job Scheduler)
▪ Hive
▪ Logical data partitioning
▪ Metastore (command line and web interfaces)
▪ Query Language
▪ Libraries to handle different serialization formats (SerDes)
▪ JDBC interface
4. Related Work
Glaringly Incomplete
▪ Gamma, Bubba, Volcano, etc.
▪ Google: Sawzall
▪ Yahoo: Pig
▪ IBM Research: JAQL
▪ Microsoft: SCOPE
▪ Greenplum: YAML MapReduce
▪ Aster Data: In-Database MapReduce
▪ Business.com: CloudBase
5. Hive Resources
▪ Facebook Mirror: http://mirror.facebook.com/facebook/hive
▪ Currently the best place to get the Hive distribution
▪ Wiki page: http://wiki.apache.org/hadoop/Hive
▪ Getting started: http://wiki.apache.org/hadoop/Hive/GettingStarted
▪ Query language reference: http://wiki.apache.org/hadoop/Hive/HiveQL
▪ Presentations: http://wiki.apache.org/hadoop/Hive/Presentations
▪ Roadmap: http://wiki.apache.org/hadoop/Hive/Roadmap
▪ Mailing list: hive-users@publists.facebook.com
▪ JIRA: https://issues.apache.org/jira/browse/HADOOP/component/12312455
6. Running Hive
Quickstart
▪ <install Hadoop>
▪ wget http://mirror.facebook.com/facebook/hive/hadoop-0.19/dist.tar.gz
▪ (Replace 0.19 with 0.17 if you’re still on 0.17)
▪ tar xvzf dist.tar.gz
▪ cd dist
▪ export HADOOP=<path to bin/hadoop in your Hadoop distribution>
▪ Or: edit hadoop.bin.path and hadoop.conf.dir in conf/hive-default.xml
▪ bin/hive
▪ hive>
7. Running Hive
Configuration Details
▪ conf/hive-default.xml
▪ hadoop.bin.path: Points to bin/hadoop in your Hadoop installation
▪ hadoop.config.dir: Points to conf/ in your Hadoop installation
▪ hive.exec.scratchdir: HDFS directory where execution information is written
▪ hive.metastore.warehouse.dir: HDFS directory managed by Hive
▪ The rest of the properties relate to the Metastore
▪ conf/hive-log4j.properties
▪ Will put data into /tmp/{user.name}/hive.log by default
▪ conf/jpox.properties
▪ JPOX is a Java object persistence library used by the Metastore
8. Populating Hive
MovieLens Data
▪ <cd into your hive directory>
▪ wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
▪ tar xvzf ml-data.tar__0.gz
▪ CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime TIMESTAMP)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
▪ The first query can take ten seconds or more, as the Metastore needs to be created
▪ To confirm our table has been created:
▪ SHOW TABLES;
▪ DESCRIBE u_data;
▪ LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;
▪ SELECT COUNT(1) FROM u_data;
▪ Should fire off 2 MapReduce jobs and ultimately return a count of 100,000
9. Hive Query Language
Utility Statements
▪ SHOW TABLES [table_name | table_name_pattern]
▪ DESCRIBE [EXTENDED] table_name
[PARTITION (partition_col = partition_col_value, ...)]
▪ EXPLAIN [EXTENDED] query_statement
▪ SET [EXTENDED]
▪ “SET property_name=property_value” to modify a value
10. Hive Query Language
CREATE TABLE Syntax
▪ CREATE [EXTERNAL] TABLE table_name (col_name data_type [col_comment], ...)
[PARTITIONED BY (col_name data_type [col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
▪ PARTITION columns are virtual columns; they are not part of the data itself but are derived on load
▪ CLUSTERED columns are real columns, hash partitioned into num_buckets folders
▪ ROW FORMAT can be used to specify a delimited data set or a custom deserializer
▪ Use EXTERNAL with ROW FORMAT, STORED AS, and LOCATION to analyze HDFS files in place
▪ “DROP TABLE table_name” can reverse this operation
▪ NB: Currently, DROP TABLE will delete both data and metadata
11. Hive Query Language
CREATE TABLE Syntax, Part Two
▪ data_type: primitive_type | array_type | map_type
▪ primitive_type:
▪ TINYINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | STRING
▪ DATE | DATETIME | TIMESTAMP
▪ array_type: ARRAY < primitive_type >
▪ map_type: MAP < primitive_type, primitive_type >
▪ row_format:
▪ DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
▪ SERIALIZER serde_name [WITH PROPERTIES property_name=property_value,
property_name=property_value, ...]
▪ file_format: SEQUENCEFILE | TEXTFILE
12. Hive Query Language
ALTER TABLE Syntax
▪ ALTER TABLE table_name RENAME TO new_table_name;
▪ ALTER TABLE table_name ADD COLUMNS (col_name data_type [col_comment], ...);
▪ ALTER TABLE DROP partition_spec, partition_spec, ...;
▪ Future work:
▪ Support for removing or renaming columns
▪ Support for altering serialization format
13. Hive Query Language
LOAD DATA Syntax
▪ LOAD DATA [LOCAL] INPATH '/path/to/file'
[OVERWRITE] INTO TABLE table_name
[PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)]
▪ You can load data from the local filesystem or anywhere in HDFS (cf. CREATE TABLE EXTERNAL)
▪ If you don’t specify OVERWRITE, data will be appended to existing table
14. Hive Query Language
SELECT Syntax
▪ [insert_clause]
SELECT [ALL|DISTINCT] select_list
FROM [table_source|join_source]
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list]
▪ insert_clause: INSERT OVERWRITE destination
▪ destination:
▪ LOCAL DIRECTORY '/local/path'
▪ DIRECTORY '/hdfs/path'
▪ TABLE table_name [PARTITION (partition_col = partiton_col_value, ...)]
15. Hive Query Language
SELECT Syntax
▪ join_source: table_source join_clause table_source join_clause table_source ...
▪ join_clause
▪ [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN ON (equality_expression, equality_expression, ...)
▪ Currently, only outer equi-joins are supported in Hive.
▪ There are two join algorithms
▪ Map-side merge join
▪ Reduce-side merge join
16. Hive Query Language
Building a Histogram of Review Counts
▪ CREATE TABLE review_counts (userid INT, review_count INT);
▪ INSERT OVERWRITE TABLE review_counts
SELECT a.userid, COUNT(1) AS review_count
FROM u_data a
GROUP BY a.userid;
▪ SELECT b.review_count, COUNT(1)
FROM review_counts b
GROUP BY b.review_count;
▪ Notes:
▪ No INSERT OVERWRITE for second query means output is dumped to the shell
▪ Hive does not currently support CREATE TABLE AS
▪ We have to create the table and then INSERT into it
▪ Hive does not currently support subqueries
▪ We have to write two queries
17. Hive Query Language
Running Custom MapReduce
▪ Put the following into weekday_mapper.py:
▪ import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print ','.join([userid, movieid, rating, str(weekday)])
▪ CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
▪ FROM u_data a
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (a.userid, a.movieid, a.rating, a.unixtime)
AS (userid, movieid, rating, weekday)
USING ‘python /full/path/to/weekday_mapper.py’
18. Hive Query Language
Programmatic Access
▪ The Hive shell can take a file with queries to be executed
▪ bin/hive -f /path/to/query/file
▪ You can also run a Hive query straight from the command line
▪ bin/hive -e 'quoted query string'
▪A simple JDBC interface is available for experimentation as well
▪ https://issues.apache.org/jira/browse/HADOOP-4101
19. Hive Components
Metastore
▪ Currently uses an embedded Derby database for persistence
▪ While Derby is in place, you’ll need to put it into Server Mode to
have more than one Hive concurrent Hive user
▪ See http://wiki.apache.org/hadoop/HiveDerbyServerMode
▪ Next release will use MySQL as default persistent data store
▪ The goal is have the persistent store be pluggable
▪ You can view the Thrift IDL for the metastore online
▪ https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/metastore/if/hive_metastore.thrift
20. Hive Components
Query Processing
▪ Compiler
▪ Parser
▪ Type Checking
▪ Semantic Analysis
▪ Plan Generation
▪ Task Generation
▪ Execution Engine
▪ Plan
▪ Operators
▪ UDFs and UDAFs
21. Future Directions
▪ Query Optimization
▪ Support for Statistics
▪ These stats are needed to make optimization decisions
▪ Join Optimizations
▪ Map-side joins, semi join techniques etc to do the join faster
▪ Predicate Pushdown Optimizations
▪ Pushing predicates just above the table scan for certain situations in joins as well as ensuring that
only required columns are sent across map/reduce boundaries
▪ Group By Optimizations
▪ Various optimizations to make group by faster
▪ Optimizations to reduce the number of map files created by filter operations
▪ Filters with a large number of mappers produces a lot of files which slows down the following
operations.
22. Future Directions
▪ MapReduce Integration
▪ Schema-less MapReduce
▪ TRANSFORM needs a schema while MapReduce is schema-less.
▪ Improvements to TRANSFORM
▪ Make this more intuitive to MapReduce developers - evaluate some other keywords, etc.
▪ User Experience
▪ Create a web interface
▪ Error reporting improvements for parse errors
▪ Add “help” command to the CLI
▪ JDBC driver to enable traditional database tools to be used with Hive
23. Future Directions
▪ Integrating Dynamic SerDe with the DDL
▪ This allows the users to create typed tables along with list and map types from the DDL
▪ Transformations in LOAD DATA
▪ LOAD DATA currently does not transform the input data if it is not in the format expected by the
destination table.
▪ Explode and Collect Operators
▪ Explode and collect operators to convert collections to individual items and vice versa.
▪ Propagating sort properties to destination tables
▪ If the query produces sorted we want to capture that in the destination table's metadata so that
downstream optimizations can be enabled.
24. (c) 2008 Cloudera, Inc. or its licensors. quot;Clouderaquot; is a registered trademark of Cloudera, Inc. All rights reserved. 1.0