This document summarizes new features for analyzing HBase data with Apache Hive, including the ability to query HBase snapshots, generate HFiles for bulk uploads to HBase, support for composite and timestamp keys, and additional improvements and future work. It provides an overview of Hive and its integration with HBase, describes the new features in detail, and indicates which releases the features will be included in.
Diamond Application Development Crafting Solutions with Precision
HBaseCon 2015: Analyzing HBase Data with Apache Hive
1. Analyzing HBase Data with
Apache Hive
Swarnim Kulkarni, Cerner Corporation
Nick Dimiduk, Hortonworks
Brock Noland, StreamSets
May 7th, 2015
2. Who are we?
● Nick Dimiduk
o Apache HBase Committer and PMC member
o Co-author of HBase in Action
● Brock Noland
o Apache Hive Committer and PMC member
● Swarnim Kulkarni
o Lead Architect at Cerner Corporation
o Contributor to Apache Hive
3. Agenda
● Apache Hive Basics
● Hive + HBase - Architecture
● Hive + HBase - Features and Improvements
● Future Work
● Q & A
4. Apache Hive
● De Facto standard for ad-hoc analysis of data in
Hadoop
● SQL-like language called HiveQL for querying of data
● Scalable
o SQL queries translate to M/R jobs
● Extensible
o Plugin custom mappers/reducers
o Custom UDFs/UDAFs
o Custom FileFormats/SerDes
6. Hive/HBase Integration
● Brings best of both world together
● Familiar analytical tooling of Hive to cover
online data stored in HBase
● No need for analysts to write M/R jobs to
analyze the data in HBase
● Uses StorageHandler to access data stored
and managed by HBase
9. Query HBase Snapshots (HIVE-6584)
● Queries over HBase snapshots on HDFS
instead of online Region Servers
● Specify hive.hbase.snapshot.name instead
of hbase.table.name to query the snapshot
● Under the hood:
o Map tasks embed mini-RS, open snapshot regions
o Snapshot restored to a unique directory under /tmp
o Location override: hive.hbase.snapshot.restoredir
10. Query HBase Snapshots (HIVE-6584)
Query without snapshots
hive> CREATE EXTERNAL TABLE store_sales(...) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' ...;
hive> SELECT * FROM store_sales WHERE ss_item_sk > 60010
and ss_ticket_number < 60030;
11. Query HBase Snapshots (HIVE-6584)
Query with snapshots
hbase(main)> snapshot 'store_sales', 'store_sales_snap0'
hive> SET hive.hbase.snapshot.name=store_sales_snap0;
hive> SELECT * FROM store_sales WHERE ss_item_sk > 60010
and ss_ticket_number < 60030;
12. ● Create HFiles with HBaseStorageHandler
● Set the following properties:
o set hive.hbase.generatehfiles=true
o set hfile.family.path=/tmp/columnfamily_name;
● hfile.family.path can also be set as a table
property
HFile support for bulk HBase uploads (HIVE-
6473)
13. HFile support for bulk HBase uploads (HIVE-
6473)
hive> CREATE EXTERNAL TABLE store_sales(...) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' ...;
hive> SET hive.hbase.generatehfiles=true;
hive> SET hfile.family.path=/tmp/new_store_sales_records/cf;
hive> INSERT OVERWRITE TABLE store_sales SELECT DISTINCT key,
value FROM some_table CLUSTER BY key;
14. Query HBase composite keys (HIVE-2599)
● Support simple and complex
implementations
● Delimiters for delimited composite keys
provided as a part of the DDL
● For complex implementations, custom
implementation of HBaseCompositeKey or
HBaseKeyFactory
15. hive> CREATE EXTERNAL TABLE hbase_table_1(key
struct<a:string,b:string,c:string>, value string)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,test-
family:test-qual")
TBLPROPERTIES ("hbase.table.name" = "SIMPLE_TABLE");
hive> select key.a,key.b,key.c from hbase_table_1;
Query HBase composite keys (HIVE-2599)
16. public class MyCompositeKey extends HBaseCompositeKey {
/** This is a required constructor **/
MyCompositeKey(LazySimpleStructObjectInspector oi, Properties tbl, Configuration conf){
…
}
@Override
Object getField(int n){
// override this to return the field at index “n” in the key
}
}
# Provide this class in the DDL
CREATE EXTERNAL TABLE MyTable(......)TBLPROPERTIES(..,hbase.composite.key.class=MyCompositeKey);
Query HBase composite keys (HIVE-2599)
17. public interface HBaseKeyFactory extends HiveStoragePredicateHandler {
/** Initialize factory with properties */
void init(HBaseSerDeParameters hbaseParam, Properties properties) throws SerDeException;
/** Create custom object inspector for hbase key */
ObjectInspector createKeyObjectInspector(TypeInfo type) throws SerDeException;
/** Create custom object for hbase key */
LazyObjectBase createKey(ObjectInspector inspector) throws SerDeException;
/** Serialize hive object in internal format of custom key */
byte[] serializeKey(Object object, StructField field) throws IOException;
}
# Provide the implementation in the DDL
CREATE EXTERNAL TABLE MyTable(......)TBLPROPERTIES(..,hbase.composite.key.factory=MyCompositeKeyFactory);
Query HBase composite keys (HIVE-2599)
18. Query HBase timestamps (HIVE-2828)
● First class support to query HBase
timestamps
● Use special :timestamp to pull up the
timestamps
● Specified as part of the
HBASE_COLUMN_MAPPING
19. Query HBase timestamps (HIVE-2828)
hive> CREATE TABLE hbase_table (key string, value
string, time timestamp)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf:string,:timestamp");
hive> SELECT key, value, cast(time as timestamp)
FROM hbase_table WHERE key > 100 AND key < 400 AND
time < 200000000000;
20. Additional Improvements
● Support to query avro structs stored in
HBase (HIVE-6147) - no serializing
capability yet (HIVE-8020)
● Support for pulling HBase columns with
wildcards (HIVE-3725)
● Multiple bug fixes and performance
enhancements
21. Coming to a Hive Release Near You!
● Query HBase Snapshots, 0.14.0
● HFile support for bulk HBase uploads, 0.14.0
● Query HBase composite keys, 0.13.0
● Query HBase timestamps, 1.1.0
● Support for pulling HBase columns with
wildcards, 0.12.0
22. Future Work
● Tighter integration with Phoenix
● Stronger support for salted HBase keys
(HIVE-7128)
● Support for HBase DataType API (HIVE-
6150)
● Improved HBase bulk load facility (HIVE-
4765)