Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

Hands on Lab
Adding Value to HBase with IBM InfoSphere
BigInsights and BigSQL
Session Number 1687
Piotr Pruski, IBM, piotr.pruski@ca.ibm.com (
Benjamin Leonhardi, IBM

@ppruski)

1

Table of Contents
Lab Setup ............................................................................................................................ 3
Getting Started .................................................................................................................... 3
Administering the Big SQL and HBase Servers................................................................. 4
Part I – Creating Big SQL Tables and Loading Data ......................................................... 6
Background ..................................................................................................................... 6
One-to-one Mapping....................................................................................................... 9
Adding New JDBC Drivers ...................................................................................... 11
One-to-one Mapping with UNIQUE Clause................................................................. 13
Many-to-one Mapping (Composite Keys and Dense Columns)................................... 16
Why do we need many-to-one mapping? ................................................................. 17
Data Collation Problem............................................................................................. 19
Many-to-one Mapping with Binary Encoding.............................................................. 20
Many-to-one Mapping with HBase Pre-created Regions and External Tables ............ 22
Load Data: Error Handling ........................................................................................... 26
[OPTIONAL] HBase Access via JAQL ....................................................................... 27
PART II – A – Query Handling........................................................................................ 31
The Data........................................................................................................................ 31
Projection Pushdown .................................................................................................... 33
Predicate Pushdown ...................................................................................................... 34
Point Scan ................................................................................................................. 34
Partial Row Scan....................................................................................................... 35
Range Scan................................................................................................................ 35
Full Table Scan ......................................................................................................... 36
Automatic Index Usage................................................................................................. 37
Pushing Down Filters into HBase................................................................................. 38
Table Access Hints ....................................................................................................... 39
Accessmode .............................................................................................................. 39
PART II – B – Connecting to Big SQL Server via JDBC ................................................ 40
Business Intelligence and Reporting via BIRT............................................................. 41
Communities ..................................................................................................................... 48
Thank You! ....................................................................................................................... 48
Acknowledgements and Disclaimers................................................................................ 49

2

Lab Setup
This lab exercise uses the IBM InfoSphere BigInsights Quick Start Edition, v2.1. The Quick
Start Edition uses a non-warranted program license, and is not for production use.
The purpose of the Quick Start Edition is for experimenting with the features of InfoSphere
BigInsights, while being able to use real data and run real applications. The Quick Start
Edition puts no data limit on the cluster and there is no time limit on the license.
The following table outlines the users and passwords that are pre-configured on the image:
username
root
biadmin
db2inst1

password
password
biadmin
password

Getting Started
To prepare for the contents of this lab, you must go through the following process to start all
of the Hadoop components.
1. Start the VMware image by clicking the “Power on this virtual machine” button in
VMware Workstation if the VM is not already on.
2. Log into the VMware virtual machine using the following information
user: biadmin
password: biadmin
3. Double-click on the BigInsights Shell folder icon from the desktop of the Quick Start
VM. This view provides you with quick links to access the following functions that will be
used throughout the course of this exercise:
Big SQL Shell
HBase Shell
Jaql Shell
Linux gnome-terminal

3

4. Open the Terminal (gnome-terminal) and start the Hadoop components (daemons).
Linux Terminal

start-all.sh

Note: This command may take a few minutes to finish.

Once all components have started successfully as shown below you may move to the next
section.
…
[INFO] Progress - 100%
[INFO] DeployManager - Start; SUCCEEDED components: [zookeeper, hadoop, derby, hive,
hbase, bigsql, oozie, orchestrator, console, httpfs]; Consumes : 174625ms

Administering the Big SQL and HBase Servers
BigInsights provides both command-line tools and a user interface to manage the Big SQL
and HBase servers. In this section, we will briefly go over the user interface which is part of
BigInsights Web Console.
1. Bring up the BigInsights web console by double clicking on the BigInsights
WebConsole icon on the desktop of the VM and open the Cluster Status tab. Select
HBase to view the status of HBase master and region servers.

2. Similarly, click on Big SQL from the same tab to view its status.

4

3. Use hbase-master and hbase-regionserver web interfaces to visualize tables, regions
and other metrics. Go to the BigInsights Welcome tab and select “Access Secure
Cluster Servers.” You may need to enable pop-ups from the site when prompted.

Alternatively, point browser to the following bottom two URL’s noted in the image below.

Some interesting information from the web interfaces are:
HBase root directory
• This can be used to find the size of an HBase table.
List of tables with descriptions.

5

Each table displays lists of regions with start and end keys.
• This information can be used to compact or split tables as needed.
Metrics for each region server.
• These can be used to determine if there are hot regions which are serving
the majority of requests to a table. Such regions can be split. It also helps
determine the effects and effectiveness of block cache, bloom filters and
memory settings.
4. Perform a health check of HBase and Big SQL which is different from the status checks
done above. It verifies the health of the functionality. From the Linux gnome-terminal,
issue the following commands.
Linux Terminal

$BIGINSIGHTS_HOME/bin/healthcheck.sh hbase

[INFO] DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ]
[INFO] Progress - Health check hbase
[INFO] Deployer - Try to start hbase if hbase service is stopped...
[INFO] Deployer - Double check whether hbase is started successfully...
[INFO] @bivm - hbase-master(active) started, pid 6627
[INFO] @bivm - hbase-regionserver started, pid 6745
[INFO] Deployer - hbase service started
[INFO] Deployer - hbase service is healthy
[INFO] Progress - 100%
[INFO] DeployManager - Health check; SUCCEEDED components: [hbase]; Consumes :
26335ms
Linux Terminal

$BIGINSIGHTS_HOME/bin/healthcheck.sh bigsql

[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
1121ms

DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ]
Progress - Health check bigsql
@bivm - bigsql-server already running, pid 6949
Deployer - Ping Check Success: bivm/192.168.230.137:7052
@bivm - bigsql is healthy
Progress - 100%
DeployManager - Health check; SUCCEEDED components: [bigsql]; Consumes :

Part I – Creating Big SQL Tables and Loading Data
In this part of the lab, our main goal is to demonstrate a migration of a table from a relational
database to Big Insights using Big SQL over HBase. We will understand how HBase handles
row keys and some pitfalls that users may encounter when moving data from a relational
database to HBase tables. We will also try some useful options like pre-creating regions to
see how it can help with data loading and queries. We will also explore various ways to load
data.

Background

6

In this lab, we will use one table from the Great Outdoors Sales Data Warehouse model
(GOSALESDW), SLS_SALES_FACT.
The details of the tables along with its primary key information are depicted in the figure
below.
SLS_SALES_FACT
PK
PK
PK
PK
PK
PK
PK

ORDER_DAY_KEY
ORGANIZATION_KEY
EMPLOYEE_KEY
RETAILER_KEY
RETAILER_SITE_KEY
PROMOTION_KEY
ORDER_METHOD_KEY
SALES_ORDER_KEY
SHIP_DAY_KEY
CLOSE_DAY_KEY
QUANTITY
UNIT_COST
UNIT_PRICE
UNIT_SALE_PRICE
GROSS_MARGIN
SALE_TOTAL
GROSS_PROFIT

There is an instance of DB2 contained on this image which contains this table with data
already loaded that we will use in our migration.
From the Linux gnome-terminal, switch to the DB2 instance user as shown below.
Linux Terminal

su - db2inst1

Note: The password for the db2inst1 is password. Enter this when prompted.

As db2inst1, connect to the pre-created database, gosales.
Linux Terminal

db2 CONNECT TO gosales

Upon successful connection, you should see the following output on the terminal.
Database Connection Information
Database server
SQL authorization ID
Local database alias

= DB2/LINUXX8664 10.5.0
= DB2INST1
= GOSALES

Issue the following command to list all of the tables contained in this database.

7

Linux Terminal

db2 LIST TABLES
Note: Here you will see three tables. Each one is essentially the same except with one key difference –
the amount of data that is contained within them. The remaining instructions in this lab exercise will use the
SLS_SALES_FACT_10P table simply for the fact that it has a smaller amount of data and will be faster to
work with for demonstration purposes. If you would like to use the larger tables with more data feel free to
do so but just remember to change the names appropriately.

Table/View
------------------------------SLS_SALES_FACT
SLS_SALES_FACT_10P
SLS_SALES_FACT_25P

Schema
--------------DB2INST1
DB2INST1
DB2INST1

Type
----T
T
T

Creation time
-------------------------2013-08-22-14.51.27.228148
2013-08-22-14.54.01.622569
2013-08-22-14.55.46.416787

3 record(s) selected.

Examine how many rows we have in this table to ensure later everything will be migrated
properly. Issue the following select statement.
Linux Terminal

db2 "SELECT COUNT(*) FROM sls_sales_fact_10p"

You should expect 44603 rows in this table.
1
----------44603

Use the following describe command to view all of the columns and data types that are
contained within this table.
Linux Terminal

db2 "DESCRIBE TABLE sls_sales_fact_10p"

8

Column name
------------------------------ORDER_DAY_KEY
ORGANIZATION_KEY
EMPLOYEE_KEY
RETAILER_KEY
RETAILER_SITE_KEY
PRODUCT_KEY
PROMOTION_KEY
ORDER_METHOD_KEY
SALES_ORDER_KEY
SHIP_DAY_KEY
CLOSE_DAY_KEY
QUANTITY
UNIT_COST
UNIT_PRICE
UNIT_SALE_PRICE
GROSS_MARGIN
SALE_TOTAL
GROSS_PROFIT

Data type
Column
schema
Data type name
Length
Scale Nulls
--------- ------------------- ---------- ----- ----SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM

INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
DECIMAL
DECIMAL
DECIMAL
DOUBLE
DECIMAL
DECIMAL

4
4
4
4
4
4
4
4
4
4
4
4
19
19
19
8
19
19

0
0
0
0
0
0
0
0
0
0
0
0
2
2
2
0
2
2

Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes


One-to-one Mapping
In this section, we will use Big SQL to do a one-to-one mapping of the columns in the
relational DB2 table to an HBase table row key and columns. This is not a recommended
approach; however, the goal of this exercise is to demonstrate the inefficiency and pitfalls
that can occur with such a mapping.
Big SQL supports both, one-to-one and many-to-one mappings.
In a one-to-one mapping, the HBase row key and each HBase column is mapped to a single
SQL column. In the following example, the HBase row key is mapped to the SQL column id.
Similarly, the cq_name column within the cf_data column family is mapped to the SQL
column ‘name’ and so on.

To begin, first create a schema to keep our tables organized. Open the BigSQL Shell from
the BigInsights Shell folder on desktop and use the create schema command to create a
schema named gosalesdw.
BigSQL Shell

CREATE SCHEMA gosalesdw;

9

Issue the following command in the same BigSQL shell that is open. This DDL statement will
create the SQL table with the one-to-one mapping of what we have in our relational DB2
source. Notice all the column names are the same with the same data types. The column
mapping section requires a mapping for the row key. HBase columns are identified using
family:qualifier.
BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT
(
ORDER_DAY_KEY
int,
ORGANIZATION_KEY int,
EMPLOYEE_KEY
int,
RETAILER_KEY
int,
RETAILER_SITE_KEY int,
PRODUCT_KEY
int,
PROMOTION_KEY
int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY
int,
SHIP_DAY_KEY
int,
CLOSE_DAY_KEY
int,
QUANTITY
int,
UNIT_COST
decimal(19,2),
UNIT_PRICE
decimal(19,2),
UNIT_SALE_PRICE
decimal(19,2),
GROSS_MARGIN
double,
SALE_TOTAL
decimal(19,2),
GROSS_PROFIT
decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY),
cf_data:cq_ORGANIZATION_KEY
mapped by (ORGANIZATION_KEY),
cf_data:cq_EMPLOYEE_KEY
mapped by (EMPLOYEE_KEY),
cf_data:cq_RETAILER_KEY
mapped by (RETAILER_KEY),
cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY),
cf_data:cq_PRODUCT_KEY
mapped by (PRODUCT_KEY),
cf_data:cq_PROMOTION_KEY
mapped by (PROMOTION_KEY),
cf_data:cq_ORDER_METHOD_KEY
mapped by (ORDER_METHOD_KEY),
cf_data:cq_SALES_ORDER_KEY
mapped by (SALES_ORDER_KEY),
cf_data:cq_SHIP_DAY_KEY
mapped by (SHIP_DAY_KEY),
cf_data:cq_CLOSE_DAY_KEY
mapped by (CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_UNIT_COST
mapped by (UNIT_COST),
cf_data:cq_UNIT_PRICE
mapped by (UNIT_PRICE),
cf_data:cq_UNIT_SALE_PRICE
mapped by (UNIT_SALE_PRICE),
cf_data:cq_GROSS_MARGIN
mapped by (GROSS_MARGIN),
cf_data:cq_SALE_TOTAL
mapped by (SALE_TOTAL),
cf_data:cq_GROSS_PROFIT
mapped by (GROSS_PROFIT)
);

Big SQL supports a load from source command that can be used to load data from
warehouse sources which we’ll use first. It also supports loading data from delimited files
using a load hbase command which we will use later.

10

Adding New JDBC Drivers
The load from source command uses Sqoop internally to do the load. Therefore, before
using the load command from a BigSQL shell, we need first add the driver for the JDBC
source into 1) the Sqoop library directory, and 2) the JSQSH terminal shared directory.
From a Linux gnome-terminal, issue the following command (as biadmin) to add the JDBC
driver JAR file to access the database to the $SQOOP_HOME/lib directory.
Linux Terminal

cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib

From the BigSQL shell, examine the drivers currently loaded for the JSQSH terminal.
BigSQL Shell

drivers

Terminate the BigSQL shell with the quit command.
BigSQL Shell

quit

Copy the same DB2 driver to the JSQSH share directory with the following command.
Linux Terminal

cp /opt/ibm/db2/V10.5/java/db2jcc.jar
$BIGINSIGHTS_HOME/bigsql/jsqsh/share/

When a user adds new drivers, the Big SQL server must be restarted. You could do this
either from the web console, or use the follow command from the Linux gnome-terminal.
Linux Terminal

stop.sh bigsql && start.sh bigsql

Open the BigSQL Shell from the BigInsights Shell folder on desktop once again since it was
closed in our earlier step with the quit command and check if in fact the driver was loaded
into JSQSH.
BigSQL Shell

drivers

Now that the drivers have been set, the load can finally take place. The load from
source statement extracts data from a source outside of an InfoSphere BigInsights cluster
(DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table.
Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the
SLS_SALES_FACT table we have defined in BigSQL.
BigSQL Shell

11

LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/GOSALES'
WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE
SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE
gosalesdw.sls_sales_fact APPEND;

You should expect to load 44603 rows which is the same number of rows that the select
count statement on the original DB2 table verified earlier.
44603 rows affected (total: 1m37.74s)

Try to verify this with a select count statement as shown.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact;

Notice there is a discrepancy between the results from the load operation and the select
count statement.
+----+
|
|
+----+
| 33 |
+----+
1 row in results(first row: 3.13s; total: 3.13s)

Also verify from an HBase shell. Open the HBase Shell from the BigInsights Shell folder on
desktop and issue the following count command to verify the number of rows.
HBase Shell

count 'gosalesdw.sls_sales_fact'

It should be apparent that the results from the Big SQL statement and HBase commands
conform to one another.
33 row(s) in 0.7000 seconds

However, this doesn’t yet explain why there is a mismatch between the number of loaded
rows and the number of retrieved rows when we query the table.
The load (and insert -- to be examined later) command behaves like upsert. Meaning, if a
row with the same row key exists, HBase will write the new value as a new version for that
column/cell. When querying the table, only latest value is returned by Big SQL.
In many cases, this behaviour could be confusing. As with our case, we tried to load data
with repeating values for a row key from a DB2 table with 44603 rows, the load reported
44603 rows affected. However, the select count(*) showed fewer rows; 33 to be exact. No
errors are thrown in such scenarios therefore it is always recommended to cross-check the
number of rows by querying the table as we did.
Now that we understand that all the rows are actually versioned in HBase, we can examine a
possible way to retrieve all versions of a particular row.

12

First, from the BigSQL shell, issue the following select query with a predicate on the order
day key. In the original table, there are most likely many tuples with the same order day key.
BigSQL Shell

SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE
order_day_key = 20070720;

As expected, we only retrieve one row, which is the latest or newest version of the row
inserted into HBase with the specified order day key.
+------------------+
| organization_key |
+------------------+
|
11171 |
+------------------+

Using the HBase shell, we can retrieve previous versions for a row key. Use the following
get command to get the top 4 versions of the row with row key 20070720.
HBase Shell

get 'gosalesdw.sls_sales_fact', '20070720', {COLUMN =>
'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}

Since the previous command specified only 4 versions (VERSIONS => 4), we only retrieve
4 rows in the output.
COLUMN
value=11171
value=11171
value=11171
value=11171

CELL
timestamp=1383365546430,
timestamp=1383365546429,
timestamp=1383365546428,
timestamp=1383365546427,

Optionally try the same command again specifying a larger version number. For example,
VERSIONS => 100.
Either way, most likely, this is not the intended behaviour that users may expect when
performing such migration. They probably wanted to get all the data into the HBase table
without versioned cells. There are a couple of solutions for this. One is to define the table
with a composite row key to enforce uniqueness which will be explored later in this lab.
Another option, outlined in the next section, is to force each row key to be unique by
appending a UUID.

One-to-one Mapping with UNIQUE Clause

13

Another option while performing such a migration is to use the force key unique option
when creating the table using BigSQL syntax. This option will force the load to add a UUID to
the row key. It helps to prevent versioning of cells. However, this method is quite inefficient
as it stores more data and also makes queries slower.
Issue the following command in the BigSQL shell. This statement will create the SQL table
with the one-to-one mapping of what we have in our relational DB2 source. This DDL
statement is almost identical to what was seen in the previous section with one exception:
the force key unique clause is specified for the column mapping of the row key.
BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE
(
ORDER_DAY_KEY
int,
ORGANIZATION_KEY int,
EMPLOYEE_KEY
int,
RETAILER_KEY
int,
RETAILER_SITE_KEY int,
PRODUCT_KEY
int,
PROMOTION_KEY
int,
SALES_ORDER_KEY
int,
SHIP_DAY_KEY
int,
CLOSE_DAY_KEY
int,
QUANTITY
int,
UNIT_COST
decimal(19,2),
UNIT_PRICE
decimal(19,2),
UNIT_SALE_PRICE
decimal(19,2),
GROSS_MARGIN
double,
SALE_TOTAL
decimal(19,2),
GROSS_PROFIT
decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY) force key unique,
mapped by (ORGANIZATION_KEY),
cf_data:cq_EMPLOYEE_KEY
mapped by (EMPLOYEE_KEY),
cf_data:cq_RETAILER_KEY
mapped by (RETAILER_KEY),
cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY),
cf_data:cq_PRODUCT_KEY
mapped by (PRODUCT_KEY),
cf_data:cq_PROMOTION_KEY
mapped by (PROMOTION_KEY),
cf_data:cq_ORDER_METHOD_KEY
mapped by (ORDER_METHOD_KEY),
cf_data:cq_SALES_ORDER_KEY
mapped by (SALES_ORDER_KEY),
cf_data:cq_SHIP_DAY_KEY
mapped by (SHIP_DAY_KEY),
cf_data:cq_CLOSE_DAY_KEY
mapped by (CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
cf_data:cq_UNIT_COST
mapped by (UNIT_COST),
cf_data:cq_UNIT_PRICE
mapped by (UNIT_PRICE),
cf_data:cq_UNIT_SALE_PRICE
mapped by (UNIT_SALE_PRICE),
cf_data:cq_GROSS_MARGIN
mapped by (GROSS_MARGIN),
cf_data:cq_SALE_TOTAL
mapped by (SALE_TOTAL),
cf_data:cq_GROSS_PROFIT
mapped by (GROSS_PROFIT)
);

14

In the previous section, we used the load from source command to get the data from
our table on DB2 source into HBase. This may not always be feasible which is why in this
section we explore another loading statement, load hbase. This will load data into HBase
using flat files – which perhaps is an export of the data form the relational source.
Issue the following statement which will load data from a file into an InfoSphere BigInsights
HBase table.
BigSQL Shell

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.sls_sales_fact_unique;

 Note: The load hbase command can take in an optional list of columns. If no column list is specified, it
will use the column ordering in table definition. The input file can be on DFS or on the local file system
where Big SQL server is running.

Once again, you should expect to load 44603 rows which is the same number of rows that
the select count statement on the original DB2 table verified.
44603 rows affected (total: 26.95s)

Verify the number of rows loaded with a select count statement as shown.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique;

This time there is no discrepancy between the results from the load operation and the select
count statement.
+-------+
|
|
+-------+
| 44603 |
+-------+

Issue the same count from the HBase shell to be sure.
HBase Shell

count 'gosalesdw.sls_sales_fact_unique'

The values are persistent across load, select, and count.
...

As in the previous section, from the BigSQL shell, issue the following select query with a
predicate on the order day key.
BigSQL Shell

15

SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE

In the previous section, only one row was returned for the specified date. This time, expect to
see 1405 rows since the rows are now forced to be unique due to our clause in the create
statement and therefore no versioning should be applied.
1405 rows in results(first row: 0.47s; total: 0.58s)

Once again, as in the previous section, we can check from the HBase shell if there are
multiple versions of the cells. Issue the following get statement to attempt to retrieve the top
4 versions of the row with row key 20070720.
HBase Shell

get 'gosalesdw.sls_sales_fact_unique', '20070720', {COLUMN =>
'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}

Zero rows are returned because the row key of 20070720 doesn’t exist. This is due to the
fact we have appended the UUID to each row key; (20070720 + UUID).
COLUMN

CELL

Therefore, instead, issue the follow HBase command to do a scan versus a get. This will
scan the table using the first part of the row key. We are also indicating scanner
specifications of start and stop row values to only return the results we are interested in
retrieving.
HBase Shell

scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => '20070720',
STOPROW => '20070721'}

Notice there are no discrepancies between the results from Big SQL select and HBase scan.

Many-to-one Mapping (Composite Keys and Dense Columns)
This section is dedicated to the other option of trying to enforce uniqueness of the cells and
that is to define a table with a composite row key (aka many-to-one mapping).
In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row
key or a column). There are two terms that may be used frequently: composite key and
dense column. A composite key is an HBase row key that is mapped to multiple SQL
columns. A dense column is an HBase column that is mapped to multiple SQL columns.
In the following example, the row key contains two parts – userid and account number. Each
part corresponds to a SQL column. Similarly, the HBase columns are mapped to multiple

16

SQL columns. Note that we can have a mix. For example, we can have a composite key, a
dense column and a non-dense column or any mix of these.

key
11111_ac11
userid

acc_no

Column Family: cf_data
cq_acct
cq_names
fname1_lname1
first_na
me

HBase

11111#11#0.25

last_na
me

balanc

min_ba

intere

SQL

Issue the following DDL statement from the BigSQL shell which represents all entities from
our relational table using a many-to-one mapping. Take notice of the column mapping
section where multiple columns can be mapped to single family:qualifier’s.
BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE
(
ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY
int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int,
SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int,
QUANTITY int,
UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE
decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2),
GROSS_PROFIT decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY,
RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY,
ORDER_METHOD_KEY),
cf_data:cq_OTHER_KEYS
mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY,
CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
cf_data:cq_DOLLAR_VALUES
mapped by (UNIT_COST, UNIT_PRICE,
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT)
);

Why do we need many-to-one mapping?
HBase stores a lot of information for each value. For each value stored, a key consisting of
the row key, column family name, column qualifier and timestamp are also stored. This
means a lot of duplicate information is kept.
HBase is very verbose and it is primarily intended for sparse data. In most cases, data in the
relational world is not sparse. If we were to store each SQL column individually on HBase, as
in our previous two sections, the required storage space would exponentially grow. When
querying that data back, the query also returns the entire key (meaning, the row key, column
family, and column qualifier) for each value. As an example, after loading data into this table
we will examine the storage space for each of the three tables created thus far.

17

As in the previous section, issue the following statement which will load data from a file into
the InfoSphere BigInsights HBase table.
BigSQL Shell

gosalesdw.sls_sales_fact_dense;

Notice, the number of rows loaded into a table with many-to-one mapping remains the same
even though we are storing less data! This statement also executes much faster than the
previous load for this exact reason.

Issue the same statements and commands from both the BigSQL and HBase shell’s as in
the previous two sections to verify that the number of rows is the same as in the original
dataset. All of the results should be the same as in the previous section.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense;

+-------+
|
|
+-------+
| 44603 |
+-------+
BigSQL Shell

SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE
HBase Shell

scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW
=> '20070721'}

As noted earlier, one-to-one mapping leads to use of too much storage space for the same
data mapped using composite keys or dense column where the HBase row key or HBase
column(s) are made up of multiple relational table columns. This is because HBase would
repeat row key, column family name, column name and timestamp for each column value.
For relational data which is usually dense, this would cause an explosion in the required
storage space.
Issue the following command as biadmin from a Linux gnome-terminal to check the directory
sizes for the three tables we created thus far.

18

Linux Terminal

hadoop fs -du /hbase/

…
17731926
3188
47906322
…

hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact
hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense
hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique

Notice that the dense table is significantly smaller than the others. The table in which we
forced uniqueness is the largest since it needs to append a UUID to each row key.

Data Collation Problem
All data represented thus far has been stored as strings. That is the default encoding on
HBase tables created by BigSQL. Therefore, numeric data is not collated correctly. HBase
uses lexicographic ordering, so you may run into cases where a query returns wrong results.
The following scenario walks through a situation where data is not collated correctly.
Using the Big SQL insert into hbase statement, add the following row to the
sls_sales_fact_dense table we previously defined and loaded data into. Notice that the date
we are specifying as part of the ORDER_DAY_KEY column (which has data type int) is a
lager numerical value and does not conform to any date standard since it contains an extra
digit.
BigSQL Shell

INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY,
ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,
PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (200707201, 11171,
4428, 7109, 5588, 30265, 5501, 605);

 Note: Insert command is available for HBase tables. However, it is not a supported feature

Issue a scan on the table with the following start and stop criteria.
HBase Shell

=> '20070721'}

Take notice of the last three rows/cells returned from the output of this scan. The newly
added row shows up in the scan even though its integer value is not between 20070720 and
20070721.

19

200707201x0011171x004428x007109x005588x003 column=cf_data:cq_DOLLAR_VALUES,
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003 column=cf_data:cq_OTHER_KEYS,
0264x005501x00605
200707201x0011171x004428x007109x005588x003 column=cf_data:cq_QUANTITY,
0264x005501x00605

Now insert another row into the table with the following command. This time we are
conforming to the date format of YYYYMMDD and incrementing the day by 1 from the last
value returned in the table; i.e., 20070721.
BigSQL Shell

INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY,
ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,
PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (20070721, 11171,
4428, 7109, 5588, 30265, 5501, 605);

Issue another scan on the table. Keep in mind to increase the stoprow criteria by 1 day.
HBase Shell

=> '20070722'}

Now notice that the newly added row is included as part of the result set, and the row with
ORDER_DAY_KEY of 200707201 is after the row with ORDER_DAY_KEY of 20070721.
This is an example of numeric data is not collated properly. Meaning, the rows are not being
stored in numerical order as one might expect but rather in byte lexicographical order.
200707201x0011171x004428x007109x005588x003
0264x005501x00605
200707201x0011171x004428x007109x005588x003
0264x005501x00605
200707201x0011171x004428x007109x005588x003
0264x005501x00605
20070721x0011171x004428x007109x005588x0030
265x005501x00605
20070721x0011171x004428x007109x005588x0030
265x005501x00605
20070721x0011171x004428x007109x005588x0030
265x005501x00605

column=cf_data:cq_DOLLAR_VALUES,

column=cf_data:cq_OTHER_KEYS,

column=cf_data:cq_QUANTITY,

column=cf_data:cq_DOLLAR_VALUES,

column=cf_data:cq_OTHER_KEYS,

column=cf_data:cq_QUANTITY,

Many-to-one Mapping with Binary Encoding

20

Big SQL supports two types of data encodings: string and binary. Each HBase entity can
also have its own encoding. For example, a row key can be encoded as a string, one HBase
column can be encoded as binary and another as string.
String is the default encoding used in Big SQL HBase tables. The value is converted to string
and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity,
separators are used to delimit data. The default separator is the null byte. As it is the lowest
byte, it maintains data collation and allows range queries and partial row scans to work
correctly.
Binary encoding in Big SQL is sort-able, so numeric data including negative number collate
properly. It handles separators internally and avoids issues of separators existing within data
by escaping it.
Issue the following DDL statement from the BigSQL shell to create a dense table as we did
in the previous section, but this time overriding the default encoding to binary.
BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY
(
QUANTITY int,
)
COLUMN MAPPING
(
key
ORDER_METHOD_KEY),
CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT)
)
default encoding binary;

Once again, use the load hbase data command to load the data into the table. This time
we are adding the DISABLE WAL clause. By using the option to disable WAL (write-ahead
log), writes into HBase can be sped up. However, this is not a safe option. Turning off WAL
can result in data loss if a region server crashes. Another possible option to speed up load is
to increase the write buffer size.
BigSQL Shell

gosalesdw.sls_sales_fact_dense_binary DISABLE WAL;

21


Issue a select statement on the newly created and loaded table with binary encoding,
sls_sales_fact_dense_binary.
BigSQL Shell

SELECT * FROM gosalesdw.sls_sales_fact_dense_binary
go –m discard;

 Note: The “go –m discard” option is used so that the results of the command will not be displayed in
the terminal.


Issue another select statement on the previous table that has string encoding,
sls_sales_fact_dense.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense
go –m discard;

One main point to see here is that the query can return faster. (Numeric types are also
collated properly).
 Note: You will probably not see much, if any, performance differences in this lab exercise since we are
working with such a small dataset.

There is no custom serialization/deserialization logic required for string encoding. This
makes it portable in the case one would want to use another application to read data in
HBase tables. A main use case for string encoding is when someone wants to map existing
data. Delimited data is a very common form of storing data and it can be easily mapped
using Big SQL string encoding. However, parsing strings is expensive and queries with data
encoded as strings are slow. Also, numeric data is not collated correctly as seen.
Queries on data encoded as binary have faster response times. Numeric data, including
negative numbers, are also collated correctly with binary encoding. The downside is you get
data encoded by Big SQL logic and may not be portable as-is.

Many-to-one Mapping with HBase Pre-created Regions and
External Tables
HBase automatically handles splitting regions when they reach a set limit. In some scenarios
like bulk loading, it is more efficient to pre-create regions so that the load operation can take
place in parallel. The data for sales is 4 months, April through July for the year 2007. We can
pre-create regions by specifying splits in create table command.

22

In this section, we will create a table within the HBase shell with pre-defined splits, not using
any Big SQL features at first. Than we will showcase how users can map existing data in
HBase to Big SQL which can prove to be a very common practice. This is made possible by
creating what is called external tables.
Start by issuing the following statement in the HBase shell. This will create the
sls_sales_fact_dense_split table with pre-defined region splits for April through July in 2007.
HBase Shell

create 'gosalesdw.sls_sales_fact_dense_split', {NAME => 'cf_data',
REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION =>
'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS =>
'0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER
=> 'NONE', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE =>
'65536'}, {SPLITS => ['200704', '200705', '200706', '200707']}

Issue the following list command on the HBase shell to verify the newly created table.
HBase Shell

list

Note that if we were to list the tables from the Big SQL shell, we would not see this table
because we have not made any association yet to Big SQL.
Open and point a browser to the following URL: http://bivm:60010/. Scroll down and click on
the table we had just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split.

23

Examine the pre-created regions for this table as we had defined when creating the table.

Execute the following create external hbase command to map the existing table we have just
created in HBase to Big SQL. Some thing to note about the command:
The create table statement allows specifying a different name for SQL table through
hbase table name clause. Using external tables, you can also create multiple views
of same HBase table. For example, one table can map to few columns and another
table to another set of columns etc.
Notice the column mapping section of the create table statement allows specifying a
different separator for each column and row key.
Another place where external tables can be used is to map tables created using Hive
HBase storage handler. These cannot be directly read using Big SQL storage
handler.
BigSQL Shell

24

CREATE EXTERNAL HBASE TABLE
GOSALESDW.EXTERNAL_SLS_SALES_FACT_DENSE_SPLIT
(
QUANTITY int,
)
COLUMN MAPPING
(
key
ORDER_METHOD_KEY) SEPARATOR '-',
CLOSE_DAY_KEY) SEPARATOR '/',
cf_data:cq_QUANTITY
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR '|'
)
HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split';

The data in external tables is not validated at creation time. For example, if a column in the
external table contains data with separators incorrectly defined, the query results would be
unpredictable.
 Note: External tables are not owned by Big SQL and hence cannot be dropped via Big SQL. Also,
secondary indexes cannot be created via Big SQL on external tables.

Use the following command to load the external table we have defined.
BigSQL Shell

gosalesdw.external_sls_sales_fact_dense_split;

Verify that the same number of rows loaded is also the same number of rows returned by
querying the external SQL table.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split;

25

+--------+
|
|
+--------+
| 446023 |
+--------+

Verify the same from the HBase shell directly on the underlying HBase table.
HBase Shell

count 'gosalesdw.sls_sales_fact_dense_split'

...

Issue a get command from the HBase shell specifying the row key as follows. Notice the
separator between each part of the row key is a “-” which is what we defined when originally
creating the external table.
HBase Shell

get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-71095588-30263-5501-605'

In the following output you can also see the other seperators we defined for the external
table. “|” for the cq_DOLLAR_VALUE, and “/” for cq_QUANTITY.
COLUMN
value=33.59|62.65|62.65|0.4638|1566.25|726.50
value=481896/20070723/20070723
cf_data:cq_QUANTITY
3 row(s) in 0. 0610 seconds

CELL
timestamp=1376690502630,
timestamp=1376690502630,
timestamp=1376690502630, value=25

Of course in Big SQL we don't need to specify the separators such as “-” when querying
against the table as with the command below.
BigSQL Shell

SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE
ORDER_DAY_KEY = 20070720 AND ORGANIZATION_KEY = 11171 AND EMPLOYEE_KEY
= 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND
PRODUCT_KEY = 30263 AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY =
605;

Load Data: Error Handling
In this final section of the part of the lab, we will examine how to handle errors during the
load operation.
The load hbase command has an option to continue past errors. The LOG ERROR ROWS
IN FILE clause can be used to specify a file name to log any rows that could not be loaded

26

because of errors. Some of the common errors are invalid numeric types, and a separator
existing within the data for string encoding.
Linux Terminal

hadoop fs -cat /user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt
2007072a
…
…
…
b0070720
…
…
…
2007-07-20
…
…
…
20070720
…
…
…
20070721
…
…
…

11171

…

…

…

…

…

…

…

…

…

…

…

… …
11171

…

…

…

…

…

…

…

…

…

…

…

11171

…

…

…

…

…

…

…

…

…

…

…

11-71

…

…

…

…

…

…

…

…

…

…

…

11171

…

…

…

…

…

…

…

…

…

…

…

… …
… …
… …
… …

Note that separator appearing within the data is an issue with string encoding.
Knowing there are errors with the input data, proceed to issue the following load command,
specifying a directory and file where to put the “bad” rows.
BigSQL Shell

LOAD HBASE DATA INPATH
'/user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt' DELIMITED FIELDS
TERMINATED BY 't' INTO TABLE
gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE
'/tmp/SLS_SALES_FACT_load.err';

In this example, 4 rows did not get loaded because of errors. Note that load reports all the
rows that passed through it
1 row affected (total: 2.74s)

Examine the specified file in the load command to view the rows which we not loaded.
Linux Terminal

hadoop fs -cat /tmp/SLS_SALES_FACT_load.err
"2007072a","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"b0070720","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"2007-07-20","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"20070720","11-71","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"

[OPTIONAL] HBase Access via JAQL
Jaql has an HBase module that can be used to create and insert data into HBase tables and
query them efficiently using multiple modes - local mode that directly access HBase as well
as map reduce mode. It allows specifying query optimization options similar to what is
available in hbase shell. The capability to transparently use map reduce jobs makes it work
well with bigger tables. At the same time, users can force local mode when they run point or

27

range queries. It allows use of a SQL language subset termed as Jaql SQL which provides
the capability to join, perform grouping and other aggregations on tables. It also provides
access to data from different sources such as relational DBMS and different formats like
delimited files, Avro and anything that is supported by Jaql. The results of the query can be
written in different formats to HDFS and read by other BigInsights applications like BigSheets
for further analysis. In this section, we’ll first pull information from our relational DMBS and
than go over use of Jaql HBase module, specifically the additional features that it provides.
Start by opening a Jaql shell. You can open the same (JSQSH) terminal that was used for
Big SQL by adding the “--jaql" option as shown below. This is a much better environment to
work with than the standard Jaql Shell as it provides features like previous command using
the up arrow key and you can also traverse through your commands using the left/right arrow
keys.
Linux Terminal

/opt/ibm/biginsights/bigsql/jsqsh/bin/jsqsh --jaql;

Once in the JSQSH shell with Jaql option, load the dbms::jdbc driver with the following
command.
BigSQL/JAQL Shell

import dbms::jdbc;

Add the JDBC driver JAR file to the classpath.
BigSQL/JAQL Shell

addRelativeClassPath(getSystemSearchPath(),
'/opt/ibm/db2/V10.5/java/db2jcc.jar');

Supply the connection information.
BigSQL/JAQL Shell

db := jdbc::connect(
driver = 'com.ibm.db2.jcc.DB2Driver',
url = 'jdbc:db2://localhost:50000/gosales',
properties = {user: "db2inst1", password: "password"} );

Specify the rows to be retrieved with a SQL select statement.
BigSQL/JAQL Shell

DESC := jdbc::prepare( db, query =
"SELECT * FROM db2inst1.sls_sales_fact_10p");

In many-to-one mapping for row key, we went over creation of a composite key. In the next
few steps, we will use Jaql to load the same data using a composite key and dense columns.
We’ll pack all columns that make up primary key of the relational table into a HBase row key,
and we’ll also pack other columns into dense HBase columns.
Define a variable to read the original data from the relational JDBC source. This converts
each tuple of the table into a JSON record.

28

BigSQL/JAQL Shell

ssf = localRead(DESC);

Transform the record into the required format. Essentially we are doing the same procedure
as when we defined the many-to-one mapping in the previous sections. For the first element,
which we will use for HBase row key, concatenate the values of the columns that form the
primary key of the sales fact table using a “-” separator. For the remaining columns, pack
them into other dense HBase columns: cq_OTHER_KEYS (using “/” separator),
cq_QUANTITY, and cq_DOLLAR_VALUES (using “|” separator).
BigSQL/JAQL Shell

ssft = ssf -> transform [$."ORDER_DAY_KEY", $."ORGANIZATION_KEY",
$."EMPLOYEE_KEY", $."RETAILER_KEY", $."RETAILER_SITE_KEY",
$."PRODUCT_KEY", $."PROMOTION_KEY", $."ORDER_METHOD_KEY",
$."SALES_ORDER_KEY", $."SHIP_DAY_KEY", $."CLOSE_DAY_KEY", $."QUANTITY",
$."UNIT_COST", $."UNIT_PRICE", $."UNIT_SALE_PRICE", $."GROSS_MARGIN",
$."SALE_TOTAL", $."GROSS_PROFIT"] -> transform
{
key: strcat($[0],"-",$[1],"-",$[2],"-",$[3],"-",$[4],"-",$[5],"",$[6],"-",$[7]),
cf_data: {
cq_OTHER_KEYS: strcat($[8],"/",$[9],"/",$[10]),
cq_QUANTITY: strcat($[11]),
cq_DOLLAR_VALUES:
strcat($[12],"|",$[13],"|",$[14],"|",$[15],"|",$[16],"|",$[17])
}
};

Verify the data is in the correct format by querying the first record.
BigSQL/JAQL Shell

ssft -> top 1;

{
"key": "20070418-11114-4415-7314-5794-30124-5501-605",
"cf_data": {
"cq_OTHER_KEYS": "254121/20070423/20070423",
"cq_QUANTITY": "60",
"cq_DOLLAR_VALUES": "610.00m|1359.72m|1291.73m|0.5278|77503.80m|40903.80m"
}
}
(1 row in 2.40s)

Now we have the data ready to be written into HBase. First import the hbase module which
prepares jaql by loading required jars and preparing the environment using the HBase
configuration files.
BigSQL/JAQL Shell

import hbase(*);

Use hbaseString to define a schema for the HBase table. The HBase table does not get
created until something is written into it. An array of records that match the specified schema

29

should be used to write into the HBase table. The data types correspond to how Jaql will
interpret the data.
BigSQL/JAQL Shell

SSFHT = hbaseString('sales_fact2', schema { key: string, cf_data?: {*:
string}}, create=true, replace=true, rowBatchSize=10000,
colBatchSize=200 );

Note: As this (could be) a big table, specify rowBatchSize and colBatchSize which will be used for
scanner catching and column batch size by the internal HBase scan object. The column batch size is useful
when there are a huge number of columns in rows.

Write to the table using the previously created ssft array which matches the specified
schema.
BigSQL/JAQL Shell

ssft -> write(SSFHT);

A write operation will create the HBase table, and populate it with the input data. To confirm,
use hbase shell to count (or scan) the table and verify the data was written with the right
number of rows.
HBase Shell

count 'sales_fact2'

To read the contents of the HBase table using Jaql, use read on the hbaseString. In the
following command we are also passing the read directly into a count function to verify the
right number of rows.
BigSQL/JAQL Shell

count(read(SSFHT));
44603

To query for rows matching a particular order day key 20070720, use setKeyRange for the
partial range query. Use localRead for point and range queries as Jaql is tuned for local
execution and performs efficiently.
BigSQL/JAQL Shell

localRead(SSFHT -> setKeyRange('20070720', '20070721'));

Perform the same query using HBase shell. Both complete in similar amount of time.
HBase Shell

scan 'sales_fact2', {STARTROW => '20070720', STOPROW => '20070721',
CACHE => 10000}

30

To query for a row when we have the values for all primary key columns, we can construct
the entire row key and perform a point query.
BigSQL/JAQL Shell

localRead(SSFHT -> setKey('20070720-11171-4428-7109-5588-30263-5501605'));

Identically, this is what the statement would look like from the HBase shell.
HBase Shell

get 'sales_fact2', '20070720-11171-4428-7109-5588-30263-5501-605'

To use a filter from Jaql, use setFilter function along with addFilter. In the below
case, the predicate is on quantity column which is the leading part of the dense column and
hence can be used in the predicate.
BigSQL/JAQL Shell

read(SSFHT -> setFilter([addFilter(filterType.SingleColumnValueFilter,
HBaseKeyArrayToBinary(["481896/"]),
compareOp.equal,
comparators.BinaryPrefixComparator,
"cf_data",
"cq_OTHER_KEYS",
true
)
])
);

PART II – A – Query Handling
Efficiently querying HBase requires pushing as much to the server(s) as possible. This
includes projection pushdown or fetching the minimal set of columns that are required by the
query. It also includes pushing down query predicates into the server as scan limits, filters,
index lookups, etc. Setting scan limits is extremely powerful as it can help to narrow down
regions we need to scan. With a full row key, HBase can quickly pinpoint the region and the
row. With partial keys and key ranges (upper, lower limits or both), HBase can narrow down
regions or eliminate regions which fall outside the range.
Indexes help to leverage this key lookup but they use two tables to achieve this. Filters
cannot eliminate regions but some have capability to skip within a region. They help to
narrow down the data set returned to the client.
With limited metadata/statistics about HBase tables, supporting a variety of hints helps
improve query efficiency.

The Data
This section describes the schema which the sample data will use to demonstrate the effects
of pushdown from Big SQL.

31

We will use a tpch table: orders table with 150,000 rows defined using the mapping shown
below.
Issue the following command from a Big SQL shell to create the orders table. Notice this
table has a many-to-one mapping, meaning there is a composite key and dense columns.
BigSQL Shell

CREATE HBASE TABLE ORDERS
(
O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1),
O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15),
O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79)
)
column mapping
(
key
mapped by (O_CUSTKEY,O_ORDERKEY),
cf:d mapped by
(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_CO
MMENT),
cf:od mapped by (O_ORDERDATE)
)
default encoding binary;

Load the sample data into the newly created table by issuing the following command.
Note: As in Part I, there are three sample sets provided for you. Each one is essentially the same except
with one key difference. This is in the amount of data that is contained within them. The remaining
instructions in this lab exercise will use the orders.10p.tbl dataset simply for the fact that it has a smaller
amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger
tables with more data feel free to do so but just remember to change the names appropriately.
BigSQL Shell

LOAD HBASE DATA INPATH 'tpch/orders.10p.tbl' DELIMITED FIELDS
TERMINATED BY '|' INTO TABLE ORDERS;

In next set of sections, we examine the output from Big SQL log files to point out what you
can check for to confirm pushdown from Big SQL. To view log messages, you may have to
first change logging levels using the below commands.
BigSQL Shell

log com.ibm.jaql.modules.hcat.mapred.JaqlHBaseInputFormat info;
BigSQL Shell

log com.ibm.jaql.modules.hcat.hbase info;

Note that columns are pushed down at HBase level. So in many-to-one mappings, if the
query requires only one part of a dense column with many parts, the entire value for dense

32

column will be returned. Therefore it is efficient to pack together columns that are usually
queried together.
Use the following command to tail the Big SQL log file. Keep this open in a terminal
throughout this entire part of this lab. We will be referring to it quite often to see what is
going on behind the scenes when running certain commands.
Linux Terminal

tail -f /var/ibm/biginsights/bigsql/logs/bigsql.log

Projection Pushdown
The first query here does a SELECT * and requests all HBase columns used in the table
mapping. The original HBase table could have a lot more columns; we may have defined an
external table mapping to just a few columns. In such cases, only the HBase columns used
in mapping will be retrieved.
BigSQL Shell

SELECT * FROM orders
go -m discard;

In the Big SQL log file, we can see that we returned data from both columns.
BigSQL Log

…
…HBase scan details:{…, families={cf=[d, od]}, …, stopRow=, startRow=,
totalColumns=2, …}

This second query request only one HBase column:
BigSQL Shell

SELECT o_totalprice FROM orders
go -m discard;

Notice that the query returns much faster since we are returning much less data.

Verify from the log file that this query only executed against one column.
BigSQL Log

…
…HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, totalColumns=1,
…}

The third query request only one HBase column also.

33

BigSQL Shell

SELECT o_orderdate FROM orders
go -m discard;

Although this query actually returns lesser data, it actually has higher response time because
serialization/deserialization of type timestamp is expensive.
BigSQL Log

…
…HBase scan details:{…, families={cf=[od]}, …, stopRow=, startRow=, totalColumns=1,
…}

Predicate Pushdown
Point Scan
Identifying and using point scans is the most effective optimization for queries into HBase.
For converting to point scan, we need to get the predicate value covering the full row key.
This could come in as multiple predicates as Big SQL supports composite keys.
The query analyzer in Big SQL is capable of combining multiple predicates to identify a full
row scan. Currently, this analysis happens at run time in the storage handler. At that point,
the decision of whether or not to use map reduce has already been made. To bypass map
reduce, a user has to provide explicit local mode access hints currently.
In the example below, the command “set force local on” makes sure all queries
executing in the session do not use map reduce.
BigSQL Shell

set force local on;

Issue the following select statement that will provide predicates for the columns that
comprise of the full row key. They are custkey and orderkey.
BigSQL Shell

select o_orderkey, o_totalprice from orders where o_custkey=4 and
o_orderkey=5612065;
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5612065 | 71845.25781 |
+------------+--------------+

If we check the logs, you can see that Big SQL successfully took both predicates specified
and combined them to do a row scan using all parts of the composite key.

34

BigSQL Log

…
… Found a row scan by combining all composite key parts.
… Found a row scan from row key parts
… HBase filter list created using AND.
… HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1):
[PrefixFilter x01x80x00x00x04], …,
stopRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!,
startRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!, totalColumns=1, …}

Partial Row Scan
This section shows the capability of Big SQL server to process predicates on leading parts of
row key – and not necessarily the full row key as in the previous section.
Issue the following example query that provides a predicate for the first part of the row key,
custkey.
BigSQL Shell

select o_orderkey, o_totalprice from orders where o_custkey=4;

+------------+--------------+
+------------+--------------+
|
5453440 | 17938.41016 |
|
5612065 | 71845.25781 |
+------------+--------------+

Checking the logs, you can see the predicate on first part of row key is converted to a range
scan. The stop row in the scan is non-inclusive. So it is internally appended with lowest
possible byte to cover the partial range.
BigSQL Log

…
… Found a row scan that uses the first 1 part(s) of composite key.
[PrefixFilter x01x80x00x00x04], …, stopRow=x01x80x00x00x04xFF,
startRow=x01x80x00x00x01, totalColumns=1, …}

Range Scan
When there are range predicates, we can set the start or stop row or both.
In our example query below we have a ‘less than’ predicate; therefore we only know the stop
row. However, even setting this will help eliminate regions with row keys that fall above the
stop row. Issue the following command.
BigSQL Shell

select o_orderkey, o_totalprice from orders where o_custkey < 15;

35

+------------+--------------+
+------------+--------------+
|
5453440 | 17938.41016 |
|
5612065 | 71845.25781 |
|
5805349 | 255145.51562 |
|
5987111 | 97765.57812 |
|
5692738 | 143292.53125 |
|
5885190 | 125285.42969 |
|
5693440 | 117319.15625 |
|
5880160 | 198773.68750 |
|
5414466 | 149205.60938 |
|
5534435 | 136184.51562 |
|
5566567 | 56285.71094 |
+------------+--------------+

Notice in the log file that similarly to the previous section, we are also only using the first part
of the composite key since we are specifying custkey as the predicate. However, in this case
since we only know the stop row (less than 3), there is no value for the start row portion of
the scan.
BigSQL Log

…
…
… HBase scan details:{…, families={cf=[d]}, …, stopRow=x01x80x00x00x0F,
startRow=, totalColumns=1, …}

Full Table Scan
This section simply shows an example of what happens when none of the predicates can be
pushed down to HBase
In this example query, the predicate (orderkey) is on non-leading part of row key and
therefore is not pushed down. Issue the command to see this will result in a full table scan.
BigSQL Shell

select o_orderkey, o_totalprice from orders where o_orderkey=5612065;

+------------+--------------+
+------------+--------------+
|
5612065 | 71845.25781 |
+------------+--------------+

As can be determined by examining the logs, in cases where none of the predicates can be
pushed to HBase, a full table scan is required. Meaning there are no specified values for
either start or stop row.
BigSQL Log

36

…
… HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, …}

Automatic Index Usage
This section will demonstrate the benefits of an index lookup.
Before creating an index, let’s first execute a query that will invoke a full table scan so we
can do a comparison later to see the performance benefits we can achieve by creating an
index on particular column(s). Notice we are specifying a predicate on the clerk column
which is the middle part of a dense column defined.
BigSQL Shell

SELECT * FROM orders WHERE o_clerk='Clerk#000000999'
go -m discard;

As you can see below in the log file, there is no usage of an index.
BigSQL Log

…
… indexScanInfo: [isIndexScan: false], valuesInfo: [minValue: undefined,
minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo:
[numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false],
indexScanCandidateInfo: [hasIndexScanCandidate: false]]
…

Issue the following command to create the index on the clerk column which is the middle part
of a dense column in table. This creates a new table to store index data. The index table
stores the column value and row key it appears in.
BigSQL Shell

CREATE INDEX ix_clerk ON TABLE orders (o_clerk) AS 'hbase';

 Note:
The create index statement will create the new index table which uses
<base_table_name>_<index_name> as its name, it deploys the coprocessor, populates the index table
using map reduce index builder. The “as hbase” clause indicates the type of index handler to use. For
HBase, we have a separate index handler.


Re-issue the exact same command as we did earlier.
BigSQL Shell

SELECT * FROM orders WHERE o_clerk='Clerk#000000999'
go -m discard;

37

After creating the index and issuing the same select statement, Big SQL will automatically
take advantage of the index that was created and avoids a full table scan which results in a
much faster response time.

You can verify in the log file that Big SQL. In this case the index table is scanned for all
matching rows that start with value of predicate clerk, in this case Clerk#000000999. From
the matching row(s), the row key(s) of base table are extracted and get requests are batched
and sent to the data table.
BigSQL Log

…
… indexScanInfo: [isIndexScan: true, keyLookupType: point_query, indexDetails:
JaqlHBaseIndex[indexName: ix_clerk, indexSpec: {"bin_terminator": "#","columns":
[{"cf": "cf","col": "o_clerk","cq": "d","from_dense": "true"}],"comp_seperator":
"%","composite": "false","key_seperator": "/","name": "ix_clerk"}, numColumns: 1,
columns: [Ljava.lang.String;@3ced3ced, startValue: x01Clerk#000000999x00,
stopValue: x01Clerk#000000999x00]], valuesInfo: [minValue: [B@4b834b83,
minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo:
[numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false],
indexScanCandidateInfo: [hasIndexScanCandidate: true, indexScanCandidate:
IndexScanCandidate[columnName: o_clerk,indexColValue: [B@4cda4cda,[operator:
=,isVariableLength: false,type: null,encoding: BINARY]]]
… Found an index scan from index scan candidates. Details:
… Index name: ix_clerk
…
… Index query details: [indexSpec:ix_clerk, startValueBytes: #Clerk#000000999,
stopValueBytes: #Clerk#000000999,baseTableScanStart:,baseTableScanStop:]
… Index query successful.

 Note: For a composite index where multiple columns are used to define an index, predicates are handled
and pushed down similar to what is done for composite row keys.

If there was no index, the predicate could not be pushed down as it is the non-leading part of
a dense column. In such cases, a full table scan is required as seen at the beginning of this
section.

Pushing Down Filters into HBase
Though HBase filters do not avoid full table scan, they limit the rows and data returned to the
client. HBase filters have a skip facility which lets them skip over certain portions of data.
Many of the inbuilt filters implement this and thus prove more efficient than a raw table scan.
There are filters that can limit the data within a row. For example, when we need to only get
columns in the key part of filter, some filters like FirstKeyOnlyFilter and
KeyOnlyFilter can be applied to get only a single instance of the row key part of data.
The sample query below will demonstrate a case where Big SQL pushes down a row scan
and a column filter.
BigSQL Shell

38

SELECT o_orderkey FROM orders WHERE o_custkey>100000 AND
o_orderstatus='P'
go -m discard;

Notice, the predicate on the custkey column triggers the row scan. The column filter,
SingleColumnValueFilter, is triggered because there is a predicate on the leading part
of a dense column (cf:d).
BigSQL Log

…
…
[SingleColumnValueFilter (cf, d, EQUAL, x01Px00)], …, stopRow=,
startRow=x01x80x01x86xA1, totalColumns=1, …}

This way Big SQL can automatically convert predicates into many of these filters and thus
handle queries more efficiently.

Table Access Hints
Access hints affect the strategy that is used to read the table, identify the source of the data,
and how to optimize a query. For example, the strategy can affect such behaviour as
whether MapReduce is employed to implement the join or whether a memory (hash) join is
employed. These hints can also control how to access data from specific sources. The table
access hint that we will explore here is: accessmode.

Accessmode
The accessmode hint is very important for HBase. It avoids map reduce overhead.
Combined with point queries, they ensure sub-second response time without being affected
by the total data size.
There are multiple ways to specify accessmode hint – as query hint or at session level. Note
that session level hints take precedence. If “set force local off;” is run in a session,
all subsequent queries will always use map reduce even if an explicit accessmode=‘local’
hint is specified on the query.
You can check the state of accessmode, if it was explicitly set, on the session with the
following command in the Big SQL shell.
BigSQL Shell

set;

If you kept the same shell open throughout this part of the lab, you will see the following
output. This is because we used “set force local on” earlier in one of the previous
sections.

39

+--------------------+-------+
| key
| value |
+--------------------+-------+
| bigsql.force.local | true |
+--------------------+-------+

To change the setting back to the default, you can change the value to automatic with the
following command.
BigSQL Shell

set force local auto;

Issue the following select query.
BigSQL Shell

select o_orderkey from orders where o_custkey=4 and o_orderkey=5612065;

Notice how long the query takes.
+------------+
| o_orderkey |
+------------+
|
5612065 |
+------------+

Issue the same query with an accessmode hint this time.
BigSQL Shell

select o_orderkey from orders /*+ accessmode='local' +*/ where
o_custkey=4 and o_orderkey=5612065;

Notice how the query responds much faster with the results. This is because of the local
accessmode, hence no mapreduce job employed.
+------------+
| o_orderkey |
+------------+
|
5612065 |
+------------+

PART II – B – Connecting to Big SQL Server via JDBC
Organizations interested in Big SQL often have considerable SQL skills in-house, as well as
a suite of SQL-based business intelligence applications and query/reporting tools. The idea
of being able to leverage existing skills and tools — and perhaps reuse portions of existing
applications — can be quite appealing to organizations new to Hadoop.

40

Therefore Big SQL supports a JDBC driver that conforms to the JDBC 3.0 specification to
provide connectivity to Java™ applications. (Big SQL also supports a 32-bit or a 64-bit
ODBC driver, on either Linux or Windows, that conforms to the Microsoft Open Database
Connectivity 3.0.0 specification, to provide connectivity to C and C++ applications).
In this part of the lab, we will explore how to use Big SQL’s JDBC driver with BIRT, an open
source business intelligence and report tool that can plug into Eclipse. We will use this tool to
run some very simple reports using SQL queries on data stored in HBase on our Hadoop
environment.

Business Intelligence and Reporting via BIRT
To start, open eclipse from the Desktop of the virtual machine by clicking on the Eclipse icon.

When promoted to do so, leave the default workspace as is.
Once Eclipse has loaded switch to the 'Report Design' perspective so that we can work with
BIRT. To do so, from the menu bar click on: Window -> Open Perspective -> Other....
Than click on: Report Design -> OK as shown below.

Once in the Report Design perspective, double-click on Orders.rptdesign from the
Navigator pane (on the bottom left-hand side) to open the pre-created report.

41

 Note: A report has been created on your behalf to quicker illustrate the functionally/usage of the Big SQL
drivers, while removing tedious steps of designing a report in BIRT.

Expand 'Data Sets' from Data Explorer. You will notice the data sets (or report queries)
contain a red 'X' beside them. This is because the pre-created report queries are not yet
associated to a data source. Now all that is necessary, prior to being able to run the report, is
to set up the JDBC connection to BigSQL.
To obtain the client drivers, open the BigInsights web console from the Desktop of the VM, or
point your browser to: http://bivm:8080. From the Welcome tab, in the Quick Links section,
select Download the Big SQL Client drivers.

Save the file to /home/biadmin/Desktop/IBD-1687A/.

42

Open the folder where you saved the file and extract the contents of the client package
under the same directory.

Back in Eclipse, add Big SQL as a source. Right-click on Data Sources -> New Data
Source from the Data Explorer pane on the top left-hand side. In the New Data Source
window, select JDBC Data Source and specify “Big SQL” for the Data Source Name. Click
Next.

43

In the New JDBC Data Source Profile window, click on Manage Drivers…. Once the
Manage JDBC Drivers window appears click on Add…

Point to the location where the client drivers were extracted than click OK.
Once added, you should have an entry for the BigSQLDriver in the Driver Class
dropdown field list. Select it, and complete the fields with the following information:
•
•
•

Database URL: jdbc:bigsql://localhost:7052
User Name: biadmin
Password: biadmin

44

Click on ‘Test Connection...’ to ensure we can connect to Big SQL using the JDBC driver.

Double-click 'Orders per year' and add the Big SQL connection that was just defined.

Examine the query:
WITH test
(order_year, order_date)
AS
(SELECT YEAR(o_orderdate), o_orderdate FROM orders FETCH FIRST 20 ROWS
ONLY)
SELECT order_year, COUNT(*) AS cnt FROM test GROUP BY order_year

45

Carry out the same procedure to add the Big SQL connection for the 'Top 5 salesmen'
data set and examine the query.
WITH base (o_clerk, tot) AS
(SELECT o_clerk, SUM(o_totalprice) AS tot FROM orders GROUP BY o_clerk
ORDER BY tot DESC)
SELECT o_clerk, tot FROM base FETCH FIRST 5 ROWS ONLY

 Note: Disregard the red ‘X’ that may still exist on the Data Sets. This is a bug and can safely be ignored.

Now that we have defined the Data Source and have the Data Sets configured, run the
report in Web Viewer as shown in the diagram below.

The output from the web viewer against the orders table on Big SQL should be as follows.

46

As seen in this part of the lab, a variety of IBM and non-IBM software that supports JDBC
and ODBC data sources can also be configured to work with Big SQL. We used BIRT here,
but as another example, Cognos Business Intelligence can uses Big SQL's JDBC interface
to query data, generate reports, and perform other analytical functions. Similarly, other tools
like Tableau can leverage Big SQL’s ODBC drivers to work with data stored in a Big Insights
cluster.

47

Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding
contributions to Information Management, Business Analytics, and
Enterprise Content Management communities
• ibm.com/champion

Thank You!

Your Feedback is Important!
• Access the Conference Agenda Builder to complete your session
surveys
o Any web or mobile browser at
http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite

48

Acknowledgements and Disclaimers:
Availability: References in this presentation to IBM products, programs, or services do not
imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and
reflect their own views. They are provided for informational purposes only, and are neither
intended to, nor shall have the effect of being, legal or other guidance or advice to any
participant. While efforts were made to verify the completeness and accuracy of the
information contained in this presentation, it is provided AS-IS without warranty of any kind,
express or implied. IBM shall not be responsible for any damages arising out of the use of, or
otherwise related to, this presentation or any other materials. Nothing contained in this
presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of
the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have
used IBM products and the results they may have achieved. Actual environmental costs and
performance characteristics may vary by customer. Nothing contained in these materials is
intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2013. All rights reserved.
•
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or
registered trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other IBM trademarked terms
are marked on their first occurrence in this information with a trademark symbol
(® or ™), these symbols indicate U.S. registered or common law trademarks
owned by IBM at the time this information was published. Such trademarks may
also be registered or common law trademarks in other countries. A current list of
IBM trademarks is available on the Web at “Copyright and trademark
information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of
others.

49

Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

Similaire à Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL (20)

Dernier

Dernier (20)

Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL