20081030linkedin

An Introduction to Hive:
Components and Query Language

Jeff Hammerbacher
Chief Scientist and VP of Product
October 30, 2008

Hive Components
A Leaky Database
▪ Hadoop

▪ HDFS

▪ MapReduce (bundles Resource Manager and Job Scheduler)
▪ Hive

▪ Logical data partitioning
▪ Metastore (command line and web interfaces)
▪ Query Language
▪ Libraries to handle different serialization formats (SerDes)
▪ JDBC interface

Related Work
Glaringly Incomplete
▪ Gamma, Bubba, Volcano, etc.
▪ Google: Sawzall
▪ Yahoo: Pig
▪ IBM Research: JAQL
▪ Microsoft: SCOPE
▪ Greenplum: YAML MapReduce
▪ Aster Data: In-Database MapReduce
▪ Business.com: CloudBase

Hive Resources
▪ Facebook Mirror: http://mirror.facebook.com/facebook/hive
▪ Currently the best place to get the Hive distribution

▪ Wiki page: http://wiki.apache.org/hadoop/Hive
▪ Getting started: http://wiki.apache.org/hadoop/Hive/GettingStarted
▪ Query language reference: http://wiki.apache.org/hadoop/Hive/HiveQL
▪ Presentations: http://wiki.apache.org/hadoop/Hive/Presentations
▪ Roadmap: http://wiki.apache.org/hadoop/Hive/Roadmap

▪ Mailing list: hive-users@publists.facebook.com

▪ JIRA: https://issues.apache.org/jira/browse/HADOOP/component/12312455

Running Hive
Quickstart
▪ <install Hadoop>
▪ wget http://mirror.facebook.com/facebook/hive/hadoop-0.19/dist.tar.gz
▪ (Replace 0.19 with 0.17 if you’re still on 0.17)
▪ tar xvzf dist.tar.gz
▪ cd dist
▪ export HADOOP=<path to bin/hadoop in your Hadoop distribution>
▪ Or: edit hadoop.bin.path and hadoop.conf.dir in conf/hive-default.xml
▪ bin/hive

▪ hive>

Running Hive
Conﬁguration Details
▪ conf/hive-default.xml

▪ hadoop.bin.path: Points to bin/hadoop in your Hadoop installation
▪ hadoop.conﬁg.dir: Points to conf/ in your Hadoop installation
▪ hive.exec.scratchdir: HDFS directory where execution information is written
▪ hive.metastore.warehouse.dir: HDFS directory managed by Hive
▪ The rest of the properties relate to the Metastore
▪ conf/hive-log4j.properties

▪ Will put data into /tmp/{user.name}/hive.log by default
▪ conf/jpox.properties

▪ JPOX is a Java object persistence library used by the Metastore

Populating Hive
MovieLens Data
▪ <cd into your hive directory>
▪ wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
▪ tar xvzf ml-data.tar__0.gz
▪ CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime TIMESTAMP)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
▪ The first query can take ten seconds or more, as the Metastore needs to be created
▪ To confirm our table has been created:
▪ SHOW TABLES;
▪ DESCRIBE u_data;
▪ LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;
▪ SELECT COUNT(1) FROM u_data;
▪ Should fire off 2 MapReduce jobs and ultimately return a count of 100,000

Hive Query Language
Utility Statements
▪ SHOW TABLES [table_name | table_name_pattern]

▪ DESCRIBE [EXTENDED] table_name
[PARTITION (partition_col = partition_col_value, ...)]

▪ EXPLAIN [EXTENDED] query_statement

▪ SET [EXTENDED]

▪ “SET property_name=property_value” to modify a value

Hive Query Language
CREATE TABLE Syntax
▪ CREATE [EXTERNAL] TABLE table_name (col_name data_type [col_comment], ...)
[PARTITIONED BY (col_name data_type [col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS ﬁle_format]
[LOCATION hdfs_path]

▪ PARTITION columns are virtual columns; they are not part of the data itself but are derived on load
▪ CLUSTERED columns are real columns, hash partitioned into num_buckets folders
▪ ROW FORMAT can be used to specify a delimited data set or a custom deserializer
▪ Use EXTERNAL with ROW FORMAT, STORED AS, and LOCATION to analyze HDFS ﬁles in place
▪ “DROP TABLE table_name” can reverse this operation
▪ NB: Currently, DROP TABLE will delete both data and metadata

Hive Query Language
CREATE TABLE Syntax, Part Two
▪ data_type: primitive_type | array_type | map_type
▪ primitive_type:
▪ TINYINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | STRING
▪ DATE | DATETIME | TIMESTAMP
▪ array_type: ARRAY < primitive_type >
▪ map_type: MAP < primitive_type, primitive_type >
▪ row_format:
▪ DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
▪ SERIALIZER serde_name [WITH PROPERTIES property_name=property_value,
property_name=property_value, ...]
▪ ﬁle_format: SEQUENCEFILE | TEXTFILE

Hive Query Language
ALTER TABLE Syntax
▪ ALTER TABLE table_name RENAME TO new_table_name;
▪ ALTER TABLE table_name ADD COLUMNS (col_name data_type [col_comment], ...);
▪ ALTER TABLE DROP partition_spec, partition_spec, ...;

▪ Future work:
▪ Support for removing or renaming columns
▪ Support for altering serialization format

Hive Query Language
LOAD DATA Syntax
▪ LOAD DATA [LOCAL] INPATH '/path/to/ﬁle'
[OVERWRITE] INTO TABLE table_name
[PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)]

▪ You can load data from the local ﬁlesystem or anywhere in HDFS (cf. CREATE TABLE EXTERNAL)

▪ If you don’t specify OVERWRITE, data will be appended to existing table

Hive Query Language
SELECT Syntax
▪ [insert_clause]
SELECT [ALL|DISTINCT] select_list
FROM [table_source|join_source]
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list]

▪ insert_clause: INSERT OVERWRITE destination

▪ destination:

▪ LOCAL DIRECTORY '/local/path'
▪ DIRECTORY '/hdfs/path'
▪ TABLE table_name [PARTITION (partition_col = partiton_col_value, ...)]

Hive Query Language
SELECT Syntax
▪ join_source: table_source join_clause table_source join_clause table_source ...

▪ join_clause

▪ [LEFT OUTER|RIGHT OUTER|FULL OUTER] JOIN ON (equality_expression, equality_expression, ...)

▪ Currently, only outer equi-joins are supported in Hive.

▪ There are two join algorithms

▪ Map-side merge join

▪ Reduce-side merge join

Hive Query Language
Building a Histogram of Review Counts
▪ CREATE TABLE review_counts (userid INT, review_count INT);
▪ INSERT OVERWRITE TABLE review_counts
SELECT a.userid, COUNT(1) AS review_count
FROM u_data a
GROUP BY a.userid;
▪ SELECT b.review_count, COUNT(1)
FROM review_counts b
GROUP BY b.review_count;
▪ Notes:
▪ No INSERT OVERWRITE for second query means output is dumped to the shell
▪ Hive does not currently support CREATE TABLE AS
▪ We have to create the table and then INSERT into it
▪ Hive does not currently support subqueries
▪ We have to write two queries

Hive Query Language
Running Custom MapReduce
▪ Put the following into weekday_mapper.py:
▪ import sys
import datetime

for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('t')
weekday = datetime.datetime.fromtimestamp(ﬂoat(unixtime)).isoweekday()
print ','.join([userid, movieid, rating, str(weekday)])
▪ CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
▪ FROM u_data a
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (a.userid, a.movieid, a.rating, a.unixtime)
AS (userid, movieid, rating, weekday)
USING ‘python /full/path/to/weekday_mapper.py’

Hive Query Language
Programmatic Access
▪ The Hive shell can take a ﬁle with queries to be executed
▪ bin/hive -f /path/to/query/ﬁle

▪ You can also run a Hive query straight from the command line
▪ bin/hive -e 'quoted query string'

▪A simple JDBC interface is available for experimentation as well
▪ https://issues.apache.org/jira/browse/HADOOP-4101

Hive Components
Metastore
▪ Currently uses an embedded Derby database for persistence
▪ While Derby is in place, you’ll need to put it into Server Mode to
have more than one Hive concurrent Hive user
▪ See http://wiki.apache.org/hadoop/HiveDerbyServerMode
▪ Next release will use MySQL as default persistent data store
▪ The goal is have the persistent store be pluggable
▪ You can view the Thrift IDL for the metastore online
▪ https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/hive/metastore/if/hive_metastore.thrift

Hive Components
Query Processing
▪ Compiler

▪ Parser

▪ Type Checking
▪ Semantic Analysis
▪ Plan Generation
▪ Task Generation
▪ Execution Engine
▪ Plan

▪ Operators

▪ UDFs and UDAFs

Future Directions
▪ Query Optimization
▪ Support for Statistics
▪ These stats are needed to make optimization decisions
▪ Join Optimizations
▪ Map-side joins, semi join techniques etc to do the join faster
▪ Predicate Pushdown Optimizations
▪ Pushing predicates just above the table scan for certain situations in joins as well as ensuring that
only required columns are sent across map/reduce boundaries
▪ Group By Optimizations
▪ Various optimizations to make group by faster
▪ Optimizations to reduce the number of map files created by filter operations
▪ Filters with a large number of mappers produces a lot of files which slows down the following
operations.

Future Directions
▪ MapReduce Integration
▪ Schema-less MapReduce
▪ TRANSFORM needs a schema while MapReduce is schema-less.
▪ Improvements to TRANSFORM
▪ Make this more intuitive to MapReduce developers - evaluate some other keywords, etc.

▪ User Experience
▪ Create a web interface
▪ Error reporting improvements for parse errors
▪ Add “help” command to the CLI
▪ JDBC driver to enable traditional database tools to be used with Hive

Future Directions
▪ Integrating Dynamic SerDe with the DDL
▪ This allows the users to create typed tables along with list and map types from the DDL

▪ Transformations in LOAD DATA
▪ LOAD DATA currently does not transform the input data if it is not in the format expected by the
destination table.

▪ Explode and Collect Operators
▪ Explode and collect operators to convert collections to individual items and vice versa.

▪ Propagating sort properties to destination tables
▪ If the query produces sorted we want to capture that in the destination table's metadata so that
downstream optimizations can be enabled.

20081030linkedin

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to 20081030linkedin

Similar to 20081030linkedin (20)

More from Jeff Hammerbacher

More from Jeff Hammerbacher (20)

Recently uploaded

Recently uploaded (20)

20081030linkedin