SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Hands on Lab
Adding Value to HBase with IBM InfoSphere
BigInsights and BigSQL
Session Number 1687
Piotr Pruski, IBM, piotr.pruski@ca.ibm.com (
Benjamin Leonhardi, IBM

@ppruski)

1
Table of Contents
Lab Setup ............................................................................................................................ 3
Getting Started .................................................................................................................... 3
Administering the Big SQL and HBase Servers................................................................. 4
Part I – Creating Big SQL Tables and Loading Data ......................................................... 6
Background ..................................................................................................................... 6
One-to-one Mapping....................................................................................................... 9
Adding New JDBC Drivers ...................................................................................... 11
One-to-one Mapping with UNIQUE Clause................................................................. 13
Many-to-one Mapping (Composite Keys and Dense Columns)................................... 16
Why do we need many-to-one mapping? ................................................................. 17
Data Collation Problem............................................................................................. 19
Many-to-one Mapping with Binary Encoding.............................................................. 20
Many-to-one Mapping with HBase Pre-created Regions and External Tables ............ 22
Load Data: Error Handling ........................................................................................... 26
[OPTIONAL] HBase Access via JAQL ....................................................................... 27
PART II – A – Query Handling........................................................................................ 31
The Data........................................................................................................................ 31
Projection Pushdown .................................................................................................... 33
Predicate Pushdown ...................................................................................................... 34
Point Scan ................................................................................................................. 34
Partial Row Scan....................................................................................................... 35
Range Scan................................................................................................................ 35
Full Table Scan ......................................................................................................... 36
Automatic Index Usage................................................................................................. 37
Pushing Down Filters into HBase................................................................................. 38
Table Access Hints ....................................................................................................... 39
Accessmode .............................................................................................................. 39
PART II – B – Connecting to Big SQL Server via JDBC ................................................ 40
Business Intelligence and Reporting via BIRT............................................................. 41
Communities ..................................................................................................................... 48
Thank You! ....................................................................................................................... 48
Acknowledgements and Disclaimers................................................................................ 49

2
Lab Setup
This lab exercise uses the IBM InfoSphere BigInsights Quick Start Edition, v2.1. The Quick
Start Edition uses a non-warranted program license, and is not for production use.
The purpose of the Quick Start Edition is for experimenting with the features of InfoSphere
BigInsights, while being able to use real data and run real applications. The Quick Start
Edition puts no data limit on the cluster and there is no time limit on the license.
The following table outlines the users and passwords that are pre-configured on the image:
username
root
biadmin
db2inst1

password
password
biadmin
password

Getting Started
To prepare for the contents of this lab, you must go through the following process to start all
of the Hadoop components.
1. Start the VMware image by clicking the “Power on this virtual machine” button in
VMware Workstation if the VM is not already on.
2. Log into the VMware virtual machine using the following information
user: biadmin
password: biadmin
3. Double-click on the BigInsights Shell folder icon from the desktop of the Quick Start
VM. This view provides you with quick links to access the following functions that will be
used throughout the course of this exercise:
Big SQL Shell
HBase Shell
Jaql Shell
Linux gnome-terminal

3
4. Open the Terminal (gnome-terminal) and start the Hadoop components (daemons).
Linux Terminal

start-all.sh

Note: This command may take a few minutes to finish.

Once all components have started successfully as shown below you may move to the next
section.
…
[INFO] Progress - 100%
[INFO] DeployManager - Start; SUCCEEDED components: [zookeeper, hadoop, derby, hive,
hbase, bigsql, oozie, orchestrator, console, httpfs]; Consumes : 174625ms

Administering the Big SQL and HBase Servers
BigInsights provides both command-line tools and a user interface to manage the Big SQL
and HBase servers. In this section, we will briefly go over the user interface which is part of
BigInsights Web Console.
1. Bring up the BigInsights web console by double clicking on the BigInsights
WebConsole icon on the desktop of the VM and open the Cluster Status tab. Select
HBase to view the status of HBase master and region servers.

2. Similarly, click on Big SQL from the same tab to view its status.

4
3. Use hbase-master and hbase-regionserver web interfaces to visualize tables, regions
and other metrics. Go to the BigInsights Welcome tab and select “Access Secure
Cluster Servers.” You may need to enable pop-ups from the site when prompted.

Alternatively, point browser to the following bottom two URL’s noted in the image below.

Some interesting information from the web interfaces are:
HBase root directory
• This can be used to find the size of an HBase table.
List of tables with descriptions.

5
Each table displays lists of regions with start and end keys.
• This information can be used to compact or split tables as needed.
Metrics for each region server.
• These can be used to determine if there are hot regions which are serving
the majority of requests to a table. Such regions can be split. It also helps
determine the effects and effectiveness of block cache, bloom filters and
memory settings.
4. Perform a health check of HBase and Big SQL which is different from the status checks
done above. It verifies the health of the functionality. From the Linux gnome-terminal,
issue the following commands.
Linux Terminal

$BIGINSIGHTS_HOME/bin/healthcheck.sh hbase

[INFO] DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ]
[INFO] Progress - Health check hbase
[INFO] Deployer - Try to start hbase if hbase service is stopped...
[INFO] Deployer - Double check whether hbase is started successfully...
[INFO] @bivm - hbase-master(active) started, pid 6627
[INFO] @bivm - hbase-regionserver started, pid 6745
[INFO] Deployer - hbase service started
[INFO] Deployer - hbase service is healthy
[INFO] Progress - 100%
[INFO] DeployManager - Health check; SUCCEEDED components: [hbase]; Consumes :
26335ms
Linux Terminal

$BIGINSIGHTS_HOME/bin/healthcheck.sh bigsql

[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
[INFO]
1121ms

DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ]
Progress - Health check bigsql
@bivm - bigsql-server already running, pid 6949
Deployer - Ping Check Success: bivm/192.168.230.137:7052
@bivm - bigsql is healthy
Progress - 100%
DeployManager - Health check; SUCCEEDED components: [bigsql]; Consumes :

Part I – Creating Big SQL Tables and Loading Data
In this part of the lab, our main goal is to demonstrate a migration of a table from a relational
database to Big Insights using Big SQL over HBase. We will understand how HBase handles
row keys and some pitfalls that users may encounter when moving data from a relational
database to HBase tables. We will also try some useful options like pre-creating regions to
see how it can help with data loading and queries. We will also explore various ways to load
data.

Background

6
In this lab, we will use one table from the Great Outdoors Sales Data Warehouse model
(GOSALESDW), SLS_SALES_FACT.
The details of the tables along with its primary key information are depicted in the figure
below.
SLS_SALES_FACT
PK
PK
PK
PK
PK
PK
PK

ORDER_DAY_KEY
ORGANIZATION_KEY
EMPLOYEE_KEY
RETAILER_KEY
RETAILER_SITE_KEY
PROMOTION_KEY
ORDER_METHOD_KEY
SALES_ORDER_KEY
SHIP_DAY_KEY
CLOSE_DAY_KEY
QUANTITY
UNIT_COST
UNIT_PRICE
UNIT_SALE_PRICE
GROSS_MARGIN
SALE_TOTAL
GROSS_PROFIT

There is an instance of DB2 contained on this image which contains this table with data
already loaded that we will use in our migration.
From the Linux gnome-terminal, switch to the DB2 instance user as shown below.
Linux Terminal

su - db2inst1

Note: The password for the db2inst1 is password. Enter this when prompted.

As db2inst1, connect to the pre-created database, gosales.
Linux Terminal

db2 CONNECT TO gosales

Upon successful connection, you should see the following output on the terminal.
Database Connection Information
Database server
SQL authorization ID
Local database alias

= DB2/LINUXX8664 10.5.0
= DB2INST1
= GOSALES

Issue the following command to list all of the tables contained in this database.

7
Linux Terminal

db2 LIST TABLES
Note: Here you will see three tables. Each one is essentially the same except with one key difference –
the amount of data that is contained within them. The remaining instructions in this lab exercise will use the
SLS_SALES_FACT_10P table simply for the fact that it has a smaller amount of data and will be faster to
work with for demonstration purposes. If you would like to use the larger tables with more data feel free to
do so but just remember to change the names appropriately.

Table/View
------------------------------SLS_SALES_FACT
SLS_SALES_FACT_10P
SLS_SALES_FACT_25P

Schema
--------------DB2INST1
DB2INST1
DB2INST1

Type
----T
T
T

Creation time
-------------------------2013-08-22-14.51.27.228148
2013-08-22-14.54.01.622569
2013-08-22-14.55.46.416787

3 record(s) selected.

Examine how many rows we have in this table to ensure later everything will be migrated
properly. Issue the following select statement.
Linux Terminal

db2 "SELECT COUNT(*) FROM sls_sales_fact_10p"

You should expect 44603 rows in this table.
1
----------44603
1 record(s) selected.

Use the following describe command to view all of the columns and data types that are
contained within this table.
Linux Terminal

db2 "DESCRIBE TABLE sls_sales_fact_10p"

8
Column name
------------------------------ORDER_DAY_KEY
ORGANIZATION_KEY
EMPLOYEE_KEY
RETAILER_KEY
RETAILER_SITE_KEY
PRODUCT_KEY
PROMOTION_KEY
ORDER_METHOD_KEY
SALES_ORDER_KEY
SHIP_DAY_KEY
CLOSE_DAY_KEY
QUANTITY
UNIT_COST
UNIT_PRICE
UNIT_SALE_PRICE
GROSS_MARGIN
SALE_TOTAL
GROSS_PROFIT

Data type
Column
schema
Data type name
Length
Scale Nulls
--------- ------------------- ---------- ----- ----SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM

INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
DECIMAL
DECIMAL
DECIMAL
DOUBLE
DECIMAL
DECIMAL

4
4
4
4
4
4
4
4
4
4
4
4
19
19
19
8
19
19

0
0
0
0
0
0
0
0
0
0
0
0
2
2
2
0
2
2

Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes

18 record(s) selected.

One-to-one Mapping
In this section, we will use Big SQL to do a one-to-one mapping of the columns in the
relational DB2 table to an HBase table row key and columns. This is not a recommended
approach; however, the goal of this exercise is to demonstrate the inefficiency and pitfalls
that can occur with such a mapping.
Big SQL supports both, one-to-one and many-to-one mappings.
In a one-to-one mapping, the HBase row key and each HBase column is mapped to a single
SQL column. In the following example, the HBase row key is mapped to the SQL column id.
Similarly, the cq_name column within the cf_data column family is mapped to the SQL
column ‘name’ and so on.

To begin, first create a schema to keep our tables organized. Open the BigSQL Shell from
the BigInsights Shell folder on desktop and use the create schema command to create a
schema named gosalesdw.
BigSQL Shell

CREATE SCHEMA gosalesdw;

9
Issue the following command in the same BigSQL shell that is open. This DDL statement will
create the SQL table with the one-to-one mapping of what we have in our relational DB2
source. Notice all the column names are the same with the same data types. The column
mapping section requires a mapping for the row key. HBase columns are identified using
family:qualifier.
BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT
(
ORDER_DAY_KEY
int,
ORGANIZATION_KEY int,
EMPLOYEE_KEY
int,
RETAILER_KEY
int,
RETAILER_SITE_KEY int,
PRODUCT_KEY
int,
PROMOTION_KEY
int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY
int,
SHIP_DAY_KEY
int,
CLOSE_DAY_KEY
int,
QUANTITY
int,
UNIT_COST
decimal(19,2),
UNIT_PRICE
decimal(19,2),
UNIT_SALE_PRICE
decimal(19,2),
GROSS_MARGIN
double,
SALE_TOTAL
decimal(19,2),
GROSS_PROFIT
decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY),
cf_data:cq_ORGANIZATION_KEY
mapped by (ORGANIZATION_KEY),
cf_data:cq_EMPLOYEE_KEY
mapped by (EMPLOYEE_KEY),
cf_data:cq_RETAILER_KEY
mapped by (RETAILER_KEY),
cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY),
cf_data:cq_PRODUCT_KEY
mapped by (PRODUCT_KEY),
cf_data:cq_PROMOTION_KEY
mapped by (PROMOTION_KEY),
cf_data:cq_ORDER_METHOD_KEY
mapped by (ORDER_METHOD_KEY),
cf_data:cq_SALES_ORDER_KEY
mapped by (SALES_ORDER_KEY),
cf_data:cq_SHIP_DAY_KEY
mapped by (SHIP_DAY_KEY),
cf_data:cq_CLOSE_DAY_KEY
mapped by (CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_UNIT_COST
mapped by (UNIT_COST),
cf_data:cq_UNIT_PRICE
mapped by (UNIT_PRICE),
cf_data:cq_UNIT_SALE_PRICE
mapped by (UNIT_SALE_PRICE),
cf_data:cq_GROSS_MARGIN
mapped by (GROSS_MARGIN),
cf_data:cq_SALE_TOTAL
mapped by (SALE_TOTAL),
cf_data:cq_GROSS_PROFIT
mapped by (GROSS_PROFIT)
);

Big SQL supports a load from source command that can be used to load data from
warehouse sources which we’ll use first. It also supports loading data from delimited files
using a load hbase command which we will use later.

10
Adding New JDBC Drivers
The load from source command uses Sqoop internally to do the load. Therefore, before
using the load command from a BigSQL shell, we need first add the driver for the JDBC
source into 1) the Sqoop library directory, and 2) the JSQSH terminal shared directory.
From a Linux gnome-terminal, issue the following command (as biadmin) to add the JDBC
driver JAR file to access the database to the $SQOOP_HOME/lib directory.
Linux Terminal

cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib

From the BigSQL shell, examine the drivers currently loaded for the JSQSH terminal.
BigSQL Shell

drivers

Terminate the BigSQL shell with the quit command.
BigSQL Shell

quit

Copy the same DB2 driver to the JSQSH share directory with the following command.
Linux Terminal

cp /opt/ibm/db2/V10.5/java/db2jcc.jar
$BIGINSIGHTS_HOME/bigsql/jsqsh/share/

When a user adds new drivers, the Big SQL server must be restarted. You could do this
either from the web console, or use the follow command from the Linux gnome-terminal.
Linux Terminal

stop.sh bigsql && start.sh bigsql

Open the BigSQL Shell from the BigInsights Shell folder on desktop once again since it was
closed in our earlier step with the quit command and check if in fact the driver was loaded
into JSQSH.
BigSQL Shell

drivers

Now that the drivers have been set, the load can finally take place. The load from
source statement extracts data from a source outside of an InfoSphere BigInsights cluster
(DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table.
Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the
SLS_SALES_FACT table we have defined in BigSQL.
BigSQL Shell

11
LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/GOSALES'
WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE
SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE
gosalesdw.sls_sales_fact APPEND;

You should expect to load 44603 rows which is the same number of rows that the select
count statement on the original DB2 table verified earlier.
44603 rows affected (total: 1m37.74s)

Try to verify this with a select count statement as shown.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact;

Notice there is a discrepancy between the results from the load operation and the select
count statement.
+----+
|
|
+----+
| 33 |
+----+
1 row in results(first row: 3.13s; total: 3.13s)

Also verify from an HBase shell. Open the HBase Shell from the BigInsights Shell folder on
desktop and issue the following count command to verify the number of rows.
HBase Shell

count 'gosalesdw.sls_sales_fact'

It should be apparent that the results from the Big SQL statement and HBase commands
conform to one another.
33 row(s) in 0.7000 seconds

However, this doesn’t yet explain why there is a mismatch between the number of loaded
rows and the number of retrieved rows when we query the table.
The load (and insert -- to be examined later) command behaves like upsert. Meaning, if a
row with the same row key exists, HBase will write the new value as a new version for that
column/cell. When querying the table, only latest value is returned by Big SQL.
In many cases, this behaviour could be confusing. As with our case, we tried to load data
with repeating values for a row key from a DB2 table with 44603 rows, the load reported
44603 rows affected. However, the select count(*) showed fewer rows; 33 to be exact. No
errors are thrown in such scenarios therefore it is always recommended to cross-check the
number of rows by querying the table as we did.
Now that we understand that all the rows are actually versioned in HBase, we can examine a
possible way to retrieve all versions of a particular row.

12
First, from the BigSQL shell, issue the following select query with a predicate on the order
day key. In the original table, there are most likely many tuples with the same order day key.
BigSQL Shell

SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE
order_day_key = 20070720;

As expected, we only retrieve one row, which is the latest or newest version of the row
inserted into HBase with the specified order day key.
+------------------+
| organization_key |
+------------------+
|
11171 |
+------------------+
33 row(s) in 0.7000 seconds

Using the HBase shell, we can retrieve previous versions for a row key. Use the following
get command to get the top 4 versions of the row with row key 20070720.
HBase Shell

get 'gosalesdw.sls_sales_fact', '20070720', {COLUMN =>
'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}

Since the previous command specified only 4 versions (VERSIONS => 4), we only retrieve
4 rows in the output.
COLUMN
cf_data:cq_ORGANIZATION_KEY
value=11171
cf_data:cq_ORGANIZATION_KEY
value=11171
cf_data:cq_ORGANIZATION_KEY
value=11171
cf_data:cq_ORGANIZATION_KEY
value=11171
4 row(s) in 0.0360 seconds

CELL
timestamp=1383365546430,
timestamp=1383365546429,
timestamp=1383365546428,
timestamp=1383365546427,

Optionally try the same command again specifying a larger version number. For example,
VERSIONS => 100.
Either way, most likely, this is not the intended behaviour that users may expect when
performing such migration. They probably wanted to get all the data into the HBase table
without versioned cells. There are a couple of solutions for this. One is to define the table
with a composite row key to enforce uniqueness which will be explored later in this lab.
Another option, outlined in the next section, is to force each row key to be unique by
appending a UUID.

One-to-one Mapping with UNIQUE Clause

13
Another option while performing such a migration is to use the force key unique option
when creating the table using BigSQL syntax. This option will force the load to add a UUID to
the row key. It helps to prevent versioning of cells. However, this method is quite inefficient
as it stores more data and also makes queries slower.
Issue the following command in the BigSQL shell. This statement will create the SQL table
with the one-to-one mapping of what we have in our relational DB2 source. This DDL
statement is almost identical to what was seen in the previous section with one exception:
the force key unique clause is specified for the column mapping of the row key.
BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE
(
ORDER_DAY_KEY
int,
ORGANIZATION_KEY int,
EMPLOYEE_KEY
int,
RETAILER_KEY
int,
RETAILER_SITE_KEY int,
PRODUCT_KEY
int,
PROMOTION_KEY
int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY
int,
SHIP_DAY_KEY
int,
CLOSE_DAY_KEY
int,
QUANTITY
int,
UNIT_COST
decimal(19,2),
UNIT_PRICE
decimal(19,2),
UNIT_SALE_PRICE
decimal(19,2),
GROSS_MARGIN
double,
SALE_TOTAL
decimal(19,2),
GROSS_PROFIT
decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY) force key unique,
cf_data:cq_ORGANIZATION_KEY
mapped by (ORGANIZATION_KEY),
cf_data:cq_EMPLOYEE_KEY
mapped by (EMPLOYEE_KEY),
cf_data:cq_RETAILER_KEY
mapped by (RETAILER_KEY),
cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY),
cf_data:cq_PRODUCT_KEY
mapped by (PRODUCT_KEY),
cf_data:cq_PROMOTION_KEY
mapped by (PROMOTION_KEY),
cf_data:cq_ORDER_METHOD_KEY
mapped by (ORDER_METHOD_KEY),
cf_data:cq_SALES_ORDER_KEY
mapped by (SALES_ORDER_KEY),
cf_data:cq_SHIP_DAY_KEY
mapped by (SHIP_DAY_KEY),
cf_data:cq_CLOSE_DAY_KEY
mapped by (CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_UNIT_COST
mapped by (UNIT_COST),
cf_data:cq_UNIT_PRICE
mapped by (UNIT_PRICE),
cf_data:cq_UNIT_SALE_PRICE
mapped by (UNIT_SALE_PRICE),
cf_data:cq_GROSS_MARGIN
mapped by (GROSS_MARGIN),
cf_data:cq_SALE_TOTAL
mapped by (SALE_TOTAL),
cf_data:cq_GROSS_PROFIT
mapped by (GROSS_PROFIT)
);

14
In the previous section, we used the load from source command to get the data from
our table on DB2 source into HBase. This may not always be feasible which is why in this
section we explore another loading statement, load hbase. This will load data into HBase
using flat files – which perhaps is an export of the data form the relational source.
Issue the following statement which will load data from a file into an InfoSphere BigInsights
HBase table.
BigSQL Shell

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.sls_sales_fact_unique;

 Note: The load hbase command can take in an optional list of columns. If no column list is specified, it
will use the column ordering in table definition. The input file can be on DFS or on the local file system
where Big SQL server is running.

Once again, you should expect to load 44603 rows which is the same number of rows that
the select count statement on the original DB2 table verified.
44603 rows affected (total: 26.95s)

Verify the number of rows loaded with a select count statement as shown.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique;

This time there is no discrepancy between the results from the load operation and the select
count statement.
+-------+
|
|
+-------+
| 44603 |
+-------+
1 row in results(first row: 1.61s; total: 1.61s)

Issue the same count from the HBase shell to be sure.
HBase Shell

count 'gosalesdw.sls_sales_fact_unique'

The values are persistent across load, select, and count.
...
44603 row(s) in 6.8490 seconds

As in the previous section, from the BigSQL shell, issue the following select query with a
predicate on the order day key.
BigSQL Shell

15
SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE
order_day_key = 20070720;

In the previous section, only one row was returned for the specified date. This time, expect to
see 1405 rows since the rows are now forced to be unique due to our clause in the create
statement and therefore no versioning should be applied.
1405 rows in results(first row: 0.47s; total: 0.58s)

Once again, as in the previous section, we can check from the HBase shell if there are
multiple versions of the cells. Issue the following get statement to attempt to retrieve the top
4 versions of the row with row key 20070720.
HBase Shell

get 'gosalesdw.sls_sales_fact_unique', '20070720', {COLUMN =>
'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}

Zero rows are returned because the row key of 20070720 doesn’t exist. This is due to the
fact we have appended the UUID to each row key; (20070720 + UUID).
COLUMN
0 row(s) in 0.0850 seconds

CELL

Therefore, instead, issue the follow HBase command to do a scan versus a get. This will
scan the table using the first part of the row key. We are also indicating scanner
specifications of start and stop row values to only return the results we are interested in
retrieving.
HBase Shell

scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => '20070720',
STOPROW => '20070721'}

Notice there are no discrepancies between the results from Big SQL select and HBase scan.
1405 row(s) in 12.1350 seconds

Many-to-one Mapping (Composite Keys and Dense Columns)
This section is dedicated to the other option of trying to enforce uniqueness of the cells and
that is to define a table with a composite row key (aka many-to-one mapping).
In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row
key or a column). There are two terms that may be used frequently: composite key and
dense column. A composite key is an HBase row key that is mapped to multiple SQL
columns. A dense column is an HBase column that is mapped to multiple SQL columns.
In the following example, the row key contains two parts – userid and account number. Each
part corresponds to a SQL column. Similarly, the HBase columns are mapped to multiple

16
SQL columns. Note that we can have a mix. For example, we can have a composite key, a
dense column and a non-dense column or any mix of these.

key
11111_ac11
userid

acc_no

Column Family: cf_data
cq_acct
cq_names
fname1_lname1
first_na
me

HBase

11111#11#0.25

last_na
me

balanc

min_ba

intere

SQL

Issue the following DDL statement from the BigSQL shell which represents all entities from
our relational table using a many-to-one mapping. Take notice of the column mapping
section where multiple columns can be mapped to single family:qualifier’s.
BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE
(
ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY
int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int,
QUANTITY int,
UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE
decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2),
GROSS_PROFIT decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY,
RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY,
ORDER_METHOD_KEY),
cf_data:cq_OTHER_KEYS
mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY,
CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_DOLLAR_VALUES
mapped by (UNIT_COST, UNIT_PRICE,
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT)
);

Why do we need many-to-one mapping?
HBase stores a lot of information for each value. For each value stored, a key consisting of
the row key, column family name, column qualifier and timestamp are also stored. This
means a lot of duplicate information is kept.
HBase is very verbose and it is primarily intended for sparse data. In most cases, data in the
relational world is not sparse. If we were to store each SQL column individually on HBase, as
in our previous two sections, the required storage space would exponentially grow. When
querying that data back, the query also returns the entire key (meaning, the row key, column
family, and column qualifier) for each value. As an example, after loading data into this table
we will examine the storage space for each of the three tables created thus far.

17
As in the previous section, issue the following statement which will load data from a file into
the InfoSphere BigInsights HBase table.
BigSQL Shell

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.sls_sales_fact_dense;

Notice, the number of rows loaded into a table with many-to-one mapping remains the same
even though we are storing less data! This statement also executes much faster than the
previous load for this exact reason.
44603 rows affected (total: 3.42s)

Issue the same statements and commands from both the BigSQL and HBase shell’s as in
the previous two sections to verify that the number of rows is the same as in the original
dataset. All of the results should be the same as in the previous section.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense;

+-------+
|
|
+-------+
| 44603 |
+-------+
1 row in results(first row: 0.93s; total: 0.93s)
BigSQL Shell

SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE
order_day_key = 20070720;
1405 rows in results(first row: 0.65s; total: 0.68s)
HBase Shell

scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW
=> '20070721'}
1405 row(s) in 4.3830 seconds

As noted earlier, one-to-one mapping leads to use of too much storage space for the same
data mapped using composite keys or dense column where the HBase row key or HBase
column(s) are made up of multiple relational table columns. This is because HBase would
repeat row key, column family name, column name and timestamp for each column value.
For relational data which is usually dense, this would cause an explosion in the required
storage space.
Issue the following command as biadmin from a Linux gnome-terminal to check the directory
sizes for the three tables we created thus far.

18
Linux Terminal

hadoop fs -du /hbase/

…
17731926
3188
47906322
…

hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact
hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense
hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique

Notice that the dense table is significantly smaller than the others. The table in which we
forced uniqueness is the largest since it needs to append a UUID to each row key.

Data Collation Problem
All data represented thus far has been stored as strings. That is the default encoding on
HBase tables created by BigSQL. Therefore, numeric data is not collated correctly. HBase
uses lexicographic ordering, so you may run into cases where a query returns wrong results.
The following scenario walks through a situation where data is not collated correctly.
Using the Big SQL insert into hbase statement, add the following row to the
sls_sales_fact_dense table we previously defined and loaded data into. Notice that the date
we are specifying as part of the ORDER_DAY_KEY column (which has data type int) is a
lager numerical value and does not conform to any date standard since it contains an extra
digit.
BigSQL Shell

INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY,
ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,
PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (200707201, 11171,
4428, 7109, 5588, 30265, 5501, 605);

 Note: Insert command is available for HBase tables. However, it is not a supported feature

Issue a scan on the table with the following start and stop criteria.
HBase Shell

scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW
=> '20070721'}

Take notice of the last three rows/cells returned from the output of this scan. The newly
added row shows up in the scan even though its integer value is not between 20070720 and
20070721.

19
200707201x0011171x004428x007109x005588x003 column=cf_data:cq_DOLLAR_VALUES,
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003 column=cf_data:cq_OTHER_KEYS,
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003 column=cf_data:cq_QUANTITY,
timestamp=1376692067977, value=
0264x005501x00605
1406 row(s) in 4.2400 seconds

Now insert another row into the table with the following command. This time we are
conforming to the date format of YYYYMMDD and incrementing the day by 1 from the last
value returned in the table; i.e., 20070721.
BigSQL Shell

INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY,
ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,
PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (20070721, 11171,
4428, 7109, 5588, 30265, 5501, 605);

Issue another scan on the table. Keep in mind to increase the stoprow criteria by 1 day.
HBase Shell

scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW
=> '20070722'}

Now notice that the newly added row is included as part of the result set, and the row with
ORDER_DAY_KEY of 200707201 is after the row with ORDER_DAY_KEY of 20070721.
This is an example of numeric data is not collated properly. Meaning, the rows are not being
stored in numerical order as one might expect but rather in byte lexicographical order.
200707201x0011171x004428x007109x005588x003
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003
timestamp=1376692067977, value=
0264x005501x00605
200707201x0011171x004428x007109x005588x003
timestamp=1376692067977, value=
0264x005501x00605
20070721x0011171x004428x007109x005588x0030
timestamp=1376692480966, value=
265x005501x00605
20070721x0011171x004428x007109x005588x0030
timestamp=1376692480966, value=
265x005501x00605
20070721x0011171x004428x007109x005588x0030
timestamp=1376692480966, value=
265x005501x00605
1407 row(s) in 2.8840 seconds

column=cf_data:cq_DOLLAR_VALUES,

column=cf_data:cq_OTHER_KEYS,

column=cf_data:cq_QUANTITY,

column=cf_data:cq_DOLLAR_VALUES,

column=cf_data:cq_OTHER_KEYS,

column=cf_data:cq_QUANTITY,

Many-to-one Mapping with Binary Encoding

20
Big SQL supports two types of data encodings: string and binary. Each HBase entity can
also have its own encoding. For example, a row key can be encoded as a string, one HBase
column can be encoded as binary and another as string.
String is the default encoding used in Big SQL HBase tables. The value is converted to string
and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity,
separators are used to delimit data. The default separator is the null byte. As it is the lowest
byte, it maintains data collation and allows range queries and partial row scans to work
correctly.
Binary encoding in Big SQL is sort-able, so numeric data including negative number collate
properly. It handles separators internally and avoids issues of separators existing within data
by escaping it.
Issue the following DDL statement from the BigSQL shell to create a dense table as we did
in the previous section, but this time overriding the default encoding to binary.
BigSQL Shell

CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY
(
ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY
int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int,
QUANTITY int,
UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE
decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2),
GROSS_PROFIT decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY,
RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY,
ORDER_METHOD_KEY),
cf_data:cq_OTHER_KEYS
mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY,
CLOSE_DAY_KEY),
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_DOLLAR_VALUES
mapped by (UNIT_COST, UNIT_PRICE,
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT)
)
default encoding binary;

Once again, use the load hbase data command to load the data into the table. This time
we are adding the DISABLE WAL clause. By using the option to disable WAL (write-ahead
log), writes into HBase can be sped up. However, this is not a safe option. Turning off WAL
can result in data loss if a region server crashes. Another possible option to speed up load is
to increase the write buffer size.
BigSQL Shell

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.sls_sales_fact_dense_binary DISABLE WAL;

21
44603 rows affected (total: 5.54s)

Issue a select statement on the newly created and loaded table with binary encoding,
sls_sales_fact_dense_binary.
BigSQL Shell

SELECT * FROM gosalesdw.sls_sales_fact_dense_binary
go –m discard;

 Note: The “go –m discard” option is used so that the results of the command will not be displayed in
the terminal.

44603 rows in results(first row: 0.35s; total: 2.89s)

Issue another select statement on the previous table that has string encoding,
sls_sales_fact_dense.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense
go –m discard;
44605 rows in results(first row: 0.31s; total: 3.1s)

One main point to see here is that the query can return faster. (Numeric types are also
collated properly).
 Note: You will probably not see much, if any, performance differences in this lab exercise since we are
working with such a small dataset.

There is no custom serialization/deserialization logic required for string encoding. This
makes it portable in the case one would want to use another application to read data in
HBase tables. A main use case for string encoding is when someone wants to map existing
data. Delimited data is a very common form of storing data and it can be easily mapped
using Big SQL string encoding. However, parsing strings is expensive and queries with data
encoded as strings are slow. Also, numeric data is not collated correctly as seen.
Queries on data encoded as binary have faster response times. Numeric data, including
negative numbers, are also collated correctly with binary encoding. The downside is you get
data encoded by Big SQL logic and may not be portable as-is.

Many-to-one Mapping with HBase Pre-created Regions and
External Tables
HBase automatically handles splitting regions when they reach a set limit. In some scenarios
like bulk loading, it is more efficient to pre-create regions so that the load operation can take
place in parallel. The data for sales is 4 months, April through July for the year 2007. We can
pre-create regions by specifying splits in create table command.

22
In this section, we will create a table within the HBase shell with pre-defined splits, not using
any Big SQL features at first. Than we will showcase how users can map existing data in
HBase to Big SQL which can prove to be a very common practice. This is made possible by
creating what is called external tables.
Start by issuing the following statement in the HBase shell. This will create the
sls_sales_fact_dense_split table with pre-defined region splits for April through July in 2007.
HBase Shell

create 'gosalesdw.sls_sales_fact_dense_split', {NAME => 'cf_data',
REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION =>
'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS =>
'0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER
=> 'NONE', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE =>
'65536'}, {SPLITS => ['200704', '200705', '200706', '200707']}

Issue the following list command on the HBase shell to verify the newly created table.
HBase Shell

list

Note that if we were to list the tables from the Big SQL shell, we would not see this table
because we have not made any association yet to Big SQL.
Open and point a browser to the following URL: http://bivm:60010/. Scroll down and click on
the table we had just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split.

23
Examine the pre-created regions for this table as we had defined when creating the table.

Execute the following create external hbase command to map the existing table we have just
created in HBase to Big SQL. Some thing to note about the command:
The create table statement allows specifying a different name for SQL table through
hbase table name clause. Using external tables, you can also create multiple views
of same HBase table. For example, one table can map to few columns and another
table to another set of columns etc.
Notice the column mapping section of the create table statement allows specifying a
different separator for each column and row key.
Another place where external tables can be used is to map tables created using Hive
HBase storage handler. These cannot be directly read using Big SQL storage
handler.
BigSQL Shell

24
CREATE EXTERNAL HBASE TABLE
GOSALESDW.EXTERNAL_SLS_SALES_FACT_DENSE_SPLIT
(
ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY
int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int,
ORDER_METHOD_KEY int,
SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int,
QUANTITY int,
UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE
decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2),
GROSS_PROFIT decimal(19,2)
)
COLUMN MAPPING
(
key
mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY,
RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY,
ORDER_METHOD_KEY) SEPARATOR '-',
cf_data:cq_OTHER_KEYS
mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY,
CLOSE_DAY_KEY) SEPARATOR '/',
cf_data:cq_QUANTITY
mapped by (QUANTITY),
cf_data:cq_DOLLAR_VALUES
mapped by (UNIT_COST, UNIT_PRICE,
UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR '|'
)
HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split';

The data in external tables is not validated at creation time. For example, if a column in the
external table contains data with separators incorrectly defined, the query results would be
unpredictable.
 Note: External tables are not owned by Big SQL and hence cannot be dropped via Big SQL. Also,
secondary indexes cannot be created via Big SQL on external tables.

Use the following command to load the external table we have defined.
BigSQL Shell

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY 't' INTO TABLE
gosalesdw.external_sls_sales_fact_dense_split;
44603 rows affected (total: 1m57.2s)

Verify that the same number of rows loaded is also the same number of rows returned by
querying the external SQL table.
BigSQL Shell

SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split;

25
+--------+
|
|
+--------+
| 446023 |
+--------+
1 row in results(first row: 6.44s; total: 6.46s)

Verify the same from the HBase shell directly on the underlying HBase table.
HBase Shell

count 'gosalesdw.sls_sales_fact_dense_split'

...
44603 row(s) in 9.1620 seconds

Issue a get command from the HBase shell specifying the row key as follows. Notice the
separator between each part of the row key is a “-” which is what we defined when originally
creating the external table.
HBase Shell

get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-71095588-30263-5501-605'

In the following output you can also see the other seperators we defined for the external
table. “|” for the cq_DOLLAR_VALUE, and “/” for cq_QUANTITY.
COLUMN
cf_data:cq_DOLLAR_VALUES
value=33.59|62.65|62.65|0.4638|1566.25|726.50
cf_data:cq_OTHER_KEYS
value=481896/20070723/20070723
cf_data:cq_QUANTITY
3 row(s) in 0. 0610 seconds

CELL
timestamp=1376690502630,
timestamp=1376690502630,
timestamp=1376690502630, value=25

Of course in Big SQL we don't need to specify the separators such as “-” when querying
against the table as with the command below.
BigSQL Shell

SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE
ORDER_DAY_KEY = 20070720 AND ORGANIZATION_KEY = 11171 AND EMPLOYEE_KEY
= 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND
PRODUCT_KEY = 30263 AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY =
605;

Load Data: Error Handling
In this final section of the part of the lab, we will examine how to handle errors during the
load operation.
The load hbase command has an option to continue past errors. The LOG ERROR ROWS
IN FILE clause can be used to specify a file name to log any rows that could not be loaded

26
because of errors. Some of the common errors are invalid numeric types, and a separator
existing within the data for string encoding.
Linux Terminal

hadoop fs -cat /user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt
2007072a
…
…
…
b0070720
…
…
…
2007-07-20
…
…
…
20070720
…
…
…
20070721
…
…
…

11171

…

…

…

…

…

…

…

…

…

…

…

… …
11171

…

…

…

…

…

…

…

…

…

…

…

11171

…

…

…

…

…

…

…

…

…

…

…

11-71

…

…

…

…

…

…

…

…

…

…

…

11171

…

…

…

…

…

…

…

…

…

…

…

… …
… …
… …
… …

Note that separator appearing within the data is an issue with string encoding.
Knowing there are errors with the input data, proceed to issue the following load command,
specifying a directory and file where to put the “bad” rows.
BigSQL Shell

LOAD HBASE DATA INPATH
'/user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt' DELIMITED FIELDS
TERMINATED BY 't' INTO TABLE
gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE
'/tmp/SLS_SALES_FACT_load.err';

In this example, 4 rows did not get loaded because of errors. Note that load reports all the
rows that passed through it
1 row affected (total: 2.74s)

Examine the specified file in the load command to view the rows which we not loaded.
Linux Terminal

hadoop fs -cat /tmp/SLS_SALES_FACT_load.err
"2007072a","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"b0070720","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"2007-07-20","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"
"20070720","11-71","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…"

[OPTIONAL] HBase Access via JAQL
Jaql has an HBase module that can be used to create and insert data into HBase tables and
query them efficiently using multiple modes - local mode that directly access HBase as well
as map reduce mode. It allows specifying query optimization options similar to what is
available in hbase shell. The capability to transparently use map reduce jobs makes it work
well with bigger tables. At the same time, users can force local mode when they run point or

27
range queries. It allows use of a SQL language subset termed as Jaql SQL which provides
the capability to join, perform grouping and other aggregations on tables. It also provides
access to data from different sources such as relational DBMS and different formats like
delimited files, Avro and anything that is supported by Jaql. The results of the query can be
written in different formats to HDFS and read by other BigInsights applications like BigSheets
for further analysis. In this section, we’ll first pull information from our relational DMBS and
than go over use of Jaql HBase module, specifically the additional features that it provides.
Start by opening a Jaql shell. You can open the same (JSQSH) terminal that was used for
Big SQL by adding the “--jaql" option as shown below. This is a much better environment to
work with than the standard Jaql Shell as it provides features like previous command using
the up arrow key and you can also traverse through your commands using the left/right arrow
keys.
Linux Terminal

/opt/ibm/biginsights/bigsql/jsqsh/bin/jsqsh --jaql;

Once in the JSQSH shell with Jaql option, load the dbms::jdbc driver with the following
command.
BigSQL/JAQL Shell

import dbms::jdbc;

Add the JDBC driver JAR file to the classpath.
BigSQL/JAQL Shell

addRelativeClassPath(getSystemSearchPath(),
'/opt/ibm/db2/V10.5/java/db2jcc.jar');

Supply the connection information.
BigSQL/JAQL Shell

db := jdbc::connect(
driver = 'com.ibm.db2.jcc.DB2Driver',
url = 'jdbc:db2://localhost:50000/gosales',
properties = {user: "db2inst1", password: "password"} );

Specify the rows to be retrieved with a SQL select statement.
BigSQL/JAQL Shell

DESC := jdbc::prepare( db, query =
"SELECT * FROM db2inst1.sls_sales_fact_10p");

In many-to-one mapping for row key, we went over creation of a composite key. In the next
few steps, we will use Jaql to load the same data using a composite key and dense columns.
We’ll pack all columns that make up primary key of the relational table into a HBase row key,
and we’ll also pack other columns into dense HBase columns.
Define a variable to read the original data from the relational JDBC source. This converts
each tuple of the table into a JSON record.

28
BigSQL/JAQL Shell

ssf = localRead(DESC);

Transform the record into the required format. Essentially we are doing the same procedure
as when we defined the many-to-one mapping in the previous sections. For the first element,
which we will use for HBase row key, concatenate the values of the columns that form the
primary key of the sales fact table using a “-” separator. For the remaining columns, pack
them into other dense HBase columns: cq_OTHER_KEYS (using “/” separator),
cq_QUANTITY, and cq_DOLLAR_VALUES (using “|” separator).
BigSQL/JAQL Shell

ssft = ssf -> transform [$."ORDER_DAY_KEY", $."ORGANIZATION_KEY",
$."EMPLOYEE_KEY", $."RETAILER_KEY", $."RETAILER_SITE_KEY",
$."PRODUCT_KEY", $."PROMOTION_KEY", $."ORDER_METHOD_KEY",
$."SALES_ORDER_KEY", $."SHIP_DAY_KEY", $."CLOSE_DAY_KEY", $."QUANTITY",
$."UNIT_COST", $."UNIT_PRICE", $."UNIT_SALE_PRICE", $."GROSS_MARGIN",
$."SALE_TOTAL", $."GROSS_PROFIT"] -> transform
{
key: strcat($[0],"-",$[1],"-",$[2],"-",$[3],"-",$[4],"-",$[5],"",$[6],"-",$[7]),
cf_data: {
cq_OTHER_KEYS: strcat($[8],"/",$[9],"/",$[10]),
cq_QUANTITY: strcat($[11]),
cq_DOLLAR_VALUES:
strcat($[12],"|",$[13],"|",$[14],"|",$[15],"|",$[16],"|",$[17])
}
};

Verify the data is in the correct format by querying the first record.
BigSQL/JAQL Shell

ssft -> top 1;

{
"key": "20070418-11114-4415-7314-5794-30124-5501-605",
"cf_data": {
"cq_OTHER_KEYS": "254121/20070423/20070423",
"cq_QUANTITY": "60",
"cq_DOLLAR_VALUES": "610.00m|1359.72m|1291.73m|0.5278|77503.80m|40903.80m"
}
}
(1 row in 2.40s)

Now we have the data ready to be written into HBase. First import the hbase module which
prepares jaql by loading required jars and preparing the environment using the HBase
configuration files.
BigSQL/JAQL Shell

import hbase(*);

Use hbaseString to define a schema for the HBase table. The HBase table does not get
created until something is written into it. An array of records that match the specified schema

29
should be used to write into the HBase table. The data types correspond to how Jaql will
interpret the data.
BigSQL/JAQL Shell

SSFHT = hbaseString('sales_fact2', schema { key: string, cf_data?: {*:
string}}, create=true, replace=true, rowBatchSize=10000,
colBatchSize=200 );

Note: As this (could be) a big table, specify rowBatchSize and colBatchSize which will be used for
scanner catching and column batch size by the internal HBase scan object. The column batch size is useful
when there are a huge number of columns in rows.

Write to the table using the previously created ssft array which matches the specified
schema.
BigSQL/JAQL Shell

ssft -> write(SSFHT);

A write operation will create the HBase table, and populate it with the input data. To confirm,
use hbase shell to count (or scan) the table and verify the data was written with the right
number of rows.
HBase Shell

count 'sales_fact2'
44603 row(s) in 3.6230 seconds

To read the contents of the HBase table using Jaql, use read on the hbaseString. In the
following command we are also passing the read directly into a count function to verify the
right number of rows.
BigSQL/JAQL Shell

count(read(SSFHT));
44603

To query for rows matching a particular order day key 20070720, use setKeyRange for the
partial range query. Use localRead for point and range queries as Jaql is tuned for local
execution and performs efficiently.
BigSQL/JAQL Shell

localRead(SSFHT -> setKeyRange('20070720', '20070721'));

Perform the same query using HBase shell. Both complete in similar amount of time.
HBase Shell

scan 'sales_fact2', {STARTROW => '20070720', STOPROW => '20070721',
CACHE => 10000}

30
To query for a row when we have the values for all primary key columns, we can construct
the entire row key and perform a point query.
BigSQL/JAQL Shell

localRead(SSFHT -> setKey('20070720-11171-4428-7109-5588-30263-5501605'));

Identically, this is what the statement would look like from the HBase shell.
HBase Shell

get 'sales_fact2', '20070720-11171-4428-7109-5588-30263-5501-605'

To use a filter from Jaql, use setFilter function along with addFilter. In the below
case, the predicate is on quantity column which is the leading part of the dense column and
hence can be used in the predicate.
BigSQL/JAQL Shell

read(SSFHT -> setFilter([addFilter(filterType.SingleColumnValueFilter,
HBaseKeyArrayToBinary(["481896/"]),
compareOp.equal,
comparators.BinaryPrefixComparator,
"cf_data",
"cq_OTHER_KEYS",
true
)
])
);

PART II – A – Query Handling
Efficiently querying HBase requires pushing as much to the server(s) as possible. This
includes projection pushdown or fetching the minimal set of columns that are required by the
query. It also includes pushing down query predicates into the server as scan limits, filters,
index lookups, etc. Setting scan limits is extremely powerful as it can help to narrow down
regions we need to scan. With a full row key, HBase can quickly pinpoint the region and the
row. With partial keys and key ranges (upper, lower limits or both), HBase can narrow down
regions or eliminate regions which fall outside the range.
Indexes help to leverage this key lookup but they use two tables to achieve this. Filters
cannot eliminate regions but some have capability to skip within a region. They help to
narrow down the data set returned to the client.
With limited metadata/statistics about HBase tables, supporting a variety of hints helps
improve query efficiency.

The Data
This section describes the schema which the sample data will use to demonstrate the effects
of pushdown from Big SQL.

31
We will use a tpch table: orders table with 150,000 rows defined using the mapping shown
below.
Issue the following command from a Big SQL shell to create the orders table. Notice this
table has a many-to-one mapping, meaning there is a composite key and dense columns.
BigSQL Shell

CREATE HBASE TABLE ORDERS
(
O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1),
O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15),
O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79)
)
column mapping
(
key
mapped by (O_CUSTKEY,O_ORDERKEY),
cf:d mapped by
(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_CO
MMENT),
cf:od mapped by (O_ORDERDATE)
)
default encoding binary;

Load the sample data into the newly created table by issuing the following command.
Note: As in Part I, there are three sample sets provided for you. Each one is essentially the same except
with one key difference. This is in the amount of data that is contained within them. The remaining
instructions in this lab exercise will use the orders.10p.tbl dataset simply for the fact that it has a smaller
amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger
tables with more data feel free to do so but just remember to change the names appropriately.
BigSQL Shell

LOAD HBASE DATA INPATH 'tpch/orders.10p.tbl' DELIMITED FIELDS
TERMINATED BY '|' INTO TABLE ORDERS;
150000 rows affected (total: 21.52s)

In next set of sections, we examine the output from Big SQL log files to point out what you
can check for to confirm pushdown from Big SQL. To view log messages, you may have to
first change logging levels using the below commands.
BigSQL Shell

log com.ibm.jaql.modules.hcat.mapred.JaqlHBaseInputFormat info;
BigSQL Shell

log com.ibm.jaql.modules.hcat.hbase info;

Note that columns are pushed down at HBase level. So in many-to-one mappings, if the
query requires only one part of a dense column with many parts, the entire value for dense

32
column will be returned. Therefore it is efficient to pack together columns that are usually
queried together.
Use the following command to tail the Big SQL log file. Keep this open in a terminal
throughout this entire part of this lab. We will be referring to it quite often to see what is
going on behind the scenes when running certain commands.
Linux Terminal

tail -f /var/ibm/biginsights/bigsql/logs/bigsql.log

Projection Pushdown
The first query here does a SELECT * and requests all HBase columns used in the table
mapping. The original HBase table could have a lot more columns; we may have defined an
external table mapping to just a few columns. In such cases, only the HBase columns used
in mapping will be retrieved.
BigSQL Shell

SELECT * FROM orders
go -m discard;
150000 rows in results(first row: 1.73s; total: 10.69s)

In the Big SQL log file, we can see that we returned data from both columns.
BigSQL Log

…
…HBase scan details:{…, families={cf=[d, od]}, …, stopRow=, startRow=,
totalColumns=2, …}

This second query request only one HBase column:
BigSQL Shell

SELECT o_totalprice FROM orders
go -m discard;

Notice that the query returns much faster since we are returning much less data.
150000 rows in results(first row: 0.27s; total: 2.83s)

Verify from the log file that this query only executed against one column.
BigSQL Log

…
…HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, totalColumns=1,
…}

The third query request only one HBase column also.

33
BigSQL Shell

SELECT o_orderdate FROM orders
go -m discard;

Although this query actually returns lesser data, it actually has higher response time because
serialization/deserialization of type timestamp is expensive.
150000 rows in results(first row: 0.37s; total: 4.5s)
BigSQL Log

…
…HBase scan details:{…, families={cf=[od]}, …, stopRow=, startRow=, totalColumns=1,
…}

Predicate Pushdown
Point Scan
Identifying and using point scans is the most effective optimization for queries into HBase.
For converting to point scan, we need to get the predicate value covering the full row key.
This could come in as multiple predicates as Big SQL supports composite keys.
The query analyzer in Big SQL is capable of combining multiple predicates to identify a full
row scan. Currently, this analysis happens at run time in the storage handler. At that point,
the decision of whether or not to use map reduce has already been made. To bypass map
reduce, a user has to provide explicit local mode access hints currently.
In the example below, the command “set force local on” makes sure all queries
executing in the session do not use map reduce.
BigSQL Shell

set force local on;

Issue the following select statement that will provide predicates for the columns that
comprise of the full row key. They are custkey and orderkey.
BigSQL Shell

select o_orderkey, o_totalprice from orders where o_custkey=4 and
o_orderkey=5612065;
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5612065 | 71845.25781 |
+------------+--------------+
1 row in results(first row: 0.18s; total: 0.18s)

If we check the logs, you can see that Big SQL successfully took both predicates specified
and combined them to do a row scan using all parts of the composite key.

34
BigSQL Log

…
… Found a row scan by combining all composite key parts.
… Found a row scan from row key parts
… HBase filter list created using AND.
… HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1):
[PrefixFilter x01x80x00x00x04], …,
stopRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!,
startRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!, totalColumns=1, …}

Partial Row Scan
This section shows the capability of Big SQL server to process predicates on leading parts of
row key – and not necessarily the full row key as in the previous section.
Issue the following example query that provides a predicate for the first part of the row key,
custkey.
BigSQL Shell

select o_orderkey, o_totalprice from orders where o_custkey=4;

+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5453440 | 17938.41016 |
|
5612065 | 71845.25781 |
+------------+--------------+
2 rows in results(first row: 0.19s; total: 0.19s)

Checking the logs, you can see the predicate on first part of row key is converted to a range
scan. The stop row in the scan is non-inclusive. So it is internally appended with lowest
possible byte to cover the partial range.
BigSQL Log

…
… Found a row scan that uses the first 1 part(s) of composite key.
… Found a row scan from row key parts
… HBase filter list created using AND.
… HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1):
[PrefixFilter x01x80x00x00x04], …, stopRow=x01x80x00x00x04xFF,
startRow=x01x80x00x00x01, totalColumns=1, …}

Range Scan
When there are range predicates, we can set the start or stop row or both.
In our example query below we have a ‘less than’ predicate; therefore we only know the stop
row. However, even setting this will help eliminate regions with row keys that fall above the
stop row. Issue the following command.
BigSQL Shell

select o_orderkey, o_totalprice from orders where o_custkey < 15;

35
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5453440 | 17938.41016 |
|
5612065 | 71845.25781 |
|
5805349 | 255145.51562 |
|
5987111 | 97765.57812 |
|
5692738 | 143292.53125 |
|
5885190 | 125285.42969 |
|
5693440 | 117319.15625 |
|
5880160 | 198773.68750 |
|
5414466 | 149205.60938 |
|
5534435 | 136184.51562 |
|
5566567 | 56285.71094 |
+------------+--------------+
11 rows in results(first row: 0.22s; total: 0.22s)

Notice in the log file that similarly to the previous section, we are also only using the first part
of the composite key since we are specifying custkey as the predicate. However, in this case
since we only know the stop row (less than 3), there is no value for the start row portion of
the scan.
BigSQL Log

…
… Found a row scan that uses the first 1 part(s) of composite key.
… Found a row scan from row key parts
…
… HBase scan details:{…, families={cf=[d]}, …, stopRow=x01x80x00x00x0F,
startRow=, totalColumns=1, …}

Full Table Scan
This section simply shows an example of what happens when none of the predicates can be
pushed down to HBase
In this example query, the predicate (orderkey) is on non-leading part of row key and
therefore is not pushed down. Issue the command to see this will result in a full table scan.
BigSQL Shell

select o_orderkey, o_totalprice from orders where o_orderkey=5612065;

+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
5612065 | 71845.25781 |
+------------+--------------+
1 row in results(first row: 1.90s; total: 1.90s)

As can be determined by examining the logs, in cases where none of the predicates can be
pushed to HBase, a full table scan is required. Meaning there are no specified values for
either start or stop row.
BigSQL Log

36
…
… HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, …}

Automatic Index Usage
This section will demonstrate the benefits of an index lookup.
Before creating an index, let’s first execute a query that will invoke a full table scan so we
can do a comparison later to see the performance benefits we can achieve by creating an
index on particular column(s). Notice we are specifying a predicate on the clerk column
which is the middle part of a dense column defined.
BigSQL Shell

SELECT * FROM orders WHERE o_clerk='Clerk#000000999'
go -m discard;
154 rows in results(first row: 2.40s; total: 4.32s)

As you can see below in the log file, there is no usage of an index.
BigSQL Log

…
… indexScanInfo: [isIndexScan: false], valuesInfo: [minValue: undefined,
minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo:
[numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false],
indexScanCandidateInfo: [hasIndexScanCandidate: false]]
…

Issue the following command to create the index on the clerk column which is the middle part
of a dense column in table. This creates a new table to store index data. The index table
stores the column value and row key it appears in.
BigSQL Shell

CREATE INDEX ix_clerk ON TABLE orders (o_clerk) AS 'hbase';

 Note:
The create index statement will create the new index table which uses
<base_table_name>_<index_name> as its name, it deploys the coprocessor, populates the index table
using map reduce index builder. The “as hbase” clause indicates the type of index handler to use. For
HBase, we have a separate index handler.

0 rows affected (total: 1m17.47s)

Re-issue the exact same command as we did earlier.
BigSQL Shell

SELECT * FROM orders WHERE o_clerk='Clerk#000000999'
go -m discard;

37
After creating the index and issuing the same select statement, Big SQL will automatically
take advantage of the index that was created and avoids a full table scan which results in a
much faster response time.
154 rows in results(first row: 0.73s; total: 0.74s)

You can verify in the log file that Big SQL. In this case the index table is scanned for all
matching rows that start with value of predicate clerk, in this case Clerk#000000999. From
the matching row(s), the row key(s) of base table are extracted and get requests are batched
and sent to the data table.
BigSQL Log

…
… indexScanInfo: [isIndexScan: true, keyLookupType: point_query, indexDetails:
JaqlHBaseIndex[indexName: ix_clerk, indexSpec: {"bin_terminator": "#","columns":
[{"cf": "cf","col": "o_clerk","cq": "d","from_dense": "true"}],"comp_seperator":
"%","composite": "false","key_seperator": "/","name": "ix_clerk"}, numColumns: 1,
columns: [Ljava.lang.String;@3ced3ced, startValue: x01Clerk#000000999x00,
stopValue: x01Clerk#000000999x00]], valuesInfo: [minValue: [B@4b834b83,
minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo:
[numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false],
indexScanCandidateInfo: [hasIndexScanCandidate: true, indexScanCandidate:
IndexScanCandidate[columnName: o_clerk,indexColValue: [B@4cda4cda,[operator:
=,isVariableLength: false,type: null,encoding: BINARY]]]
… Found an index scan from index scan candidates. Details:
… Index name: ix_clerk
…
… Index query details: [indexSpec:ix_clerk, startValueBytes: #Clerk#000000999,
stopValueBytes: #Clerk#000000999,baseTableScanStart:,baseTableScanStop:]
… Index query successful.

 Note: For a composite index where multiple columns are used to define an index, predicates are handled
and pushed down similar to what is done for composite row keys.

If there was no index, the predicate could not be pushed down as it is the non-leading part of
a dense column. In such cases, a full table scan is required as seen at the beginning of this
section.

Pushing Down Filters into HBase
Though HBase filters do not avoid full table scan, they limit the rows and data returned to the
client. HBase filters have a skip facility which lets them skip over certain portions of data.
Many of the inbuilt filters implement this and thus prove more efficient than a raw table scan.
There are filters that can limit the data within a row. For example, when we need to only get
columns in the key part of filter, some filters like FirstKeyOnlyFilter and
KeyOnlyFilter can be applied to get only a single instance of the row key part of data.
The sample query below will demonstrate a case where Big SQL pushes down a row scan
and a column filter.
BigSQL Shell

38
SELECT o_orderkey FROM orders WHERE o_custkey>100000 AND
o_orderstatus='P'
go -m discard;
1278 rows in results(first row: 0.37s; total: 0.38s)

Notice, the predicate on the custkey column triggers the row scan. The column filter,
SingleColumnValueFilter, is triggered because there is a predicate on the leading part
of a dense column (cf:d).
BigSQL Log

…
… Found a row scan that uses the first 1 part(s) of composite key.
… Found a row scan from row key parts
… HBase filter list created using AND.
…
… HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1):
[SingleColumnValueFilter (cf, d, EQUAL, x01Px00)], …, stopRow=,
startRow=x01x80x01x86xA1, totalColumns=1, …}

This way Big SQL can automatically convert predicates into many of these filters and thus
handle queries more efficiently.

Table Access Hints
Access hints affect the strategy that is used to read the table, identify the source of the data,
and how to optimize a query. For example, the strategy can affect such behaviour as
whether MapReduce is employed to implement the join or whether a memory (hash) join is
employed. These hints can also control how to access data from specific sources. The table
access hint that we will explore here is: accessmode.

Accessmode
The accessmode hint is very important for HBase. It avoids map reduce overhead.
Combined with point queries, they ensure sub-second response time without being affected
by the total data size.
There are multiple ways to specify accessmode hint – as query hint or at session level. Note
that session level hints take precedence. If “set force local off;” is run in a session,
all subsequent queries will always use map reduce even if an explicit accessmode=‘local’
hint is specified on the query.
You can check the state of accessmode, if it was explicitly set, on the session with the
following command in the Big SQL shell.
BigSQL Shell

set;

If you kept the same shell open throughout this part of the lab, you will see the following
output. This is because we used “set force local on” earlier in one of the previous
sections.

39
+--------------------+-------+
| key
| value |
+--------------------+-------+
| bigsql.force.local | true |
+--------------------+-------+
1 row in results(first row: 0.0s; total: 0.0s)

To change the setting back to the default, you can change the value to automatic with the
following command.
BigSQL Shell

set force local auto;

Issue the following select query.
BigSQL Shell

select o_orderkey from orders where o_custkey=4 and o_orderkey=5612065;

Notice how long the query takes.
+------------+
| o_orderkey |
+------------+
|
5612065 |
+------------+
1 row in results(first row: 7.2s; total: 7.2s)

Issue the same query with an accessmode hint this time.
BigSQL Shell

select o_orderkey from orders /*+ accessmode='local' +*/ where
o_custkey=4 and o_orderkey=5612065;

Notice how the query responds much faster with the results. This is because of the local
accessmode, hence no mapreduce job employed.
+------------+
| o_orderkey |
+------------+
|
5612065 |
+------------+
1 row in results(first row: 0.32s; total: 0.32s)

PART II – B – Connecting to Big SQL Server via JDBC
Organizations interested in Big SQL often have considerable SQL skills in-house, as well as
a suite of SQL-based business intelligence applications and query/reporting tools. The idea
of being able to leverage existing skills and tools — and perhaps reuse portions of existing
applications — can be quite appealing to organizations new to Hadoop.

40
Therefore Big SQL supports a JDBC driver that conforms to the JDBC 3.0 specification to
provide connectivity to Java™ applications. (Big SQL also supports a 32-bit or a 64-bit
ODBC driver, on either Linux or Windows, that conforms to the Microsoft Open Database
Connectivity 3.0.0 specification, to provide connectivity to C and C++ applications).
In this part of the lab, we will explore how to use Big SQL’s JDBC driver with BIRT, an open
source business intelligence and report tool that can plug into Eclipse. We will use this tool to
run some very simple reports using SQL queries on data stored in HBase on our Hadoop
environment.

Business Intelligence and Reporting via BIRT
To start, open eclipse from the Desktop of the virtual machine by clicking on the Eclipse icon.

When promoted to do so, leave the default workspace as is.
Once Eclipse has loaded switch to the 'Report Design' perspective so that we can work with
BIRT. To do so, from the menu bar click on: Window -> Open Perspective -> Other....
Than click on: Report Design -> OK as shown below.

Once in the Report Design perspective, double-click on Orders.rptdesign from the
Navigator pane (on the bottom left-hand side) to open the pre-created report.

41
 Note: A report has been created on your behalf to quicker illustrate the functionally/usage of the Big SQL
drivers, while removing tedious steps of designing a report in BIRT.

Expand 'Data Sets' from Data Explorer. You will notice the data sets (or report queries)
contain a red 'X' beside them. This is because the pre-created report queries are not yet
associated to a data source. Now all that is necessary, prior to being able to run the report, is
to set up the JDBC connection to BigSQL.
To obtain the client drivers, open the BigInsights web console from the Desktop of the VM, or
point your browser to: http://bivm:8080. From the Welcome tab, in the Quick Links section,
select Download the Big SQL Client drivers.

Save the file to /home/biadmin/Desktop/IBD-1687A/.

42
Open the folder where you saved the file and extract the contents of the client package
under the same directory.

Back in Eclipse, add Big SQL as a source. Right-click on Data Sources -> New Data
Source from the Data Explorer pane on the top left-hand side. In the New Data Source
window, select JDBC Data Source and specify “Big SQL” for the Data Source Name. Click
Next.

43
In the New JDBC Data Source Profile window, click on Manage Drivers…. Once the
Manage JDBC Drivers window appears click on Add…

Point to the location where the client drivers were extracted than click OK.
Once added, you should have an entry for the BigSQLDriver in the Driver Class
dropdown field list. Select it, and complete the fields with the following information:
•
•
•

Database URL: jdbc:bigsql://localhost:7052
User Name: biadmin
Password: biadmin

44
Click on ‘Test Connection...’ to ensure we can connect to Big SQL using the JDBC driver.

Double-click 'Orders per year' and add the Big SQL connection that was just defined.

Examine the query:
WITH test
(order_year, order_date)
AS
(SELECT YEAR(o_orderdate), o_orderdate FROM orders FETCH FIRST 20 ROWS
ONLY)
SELECT order_year, COUNT(*) AS cnt FROM test GROUP BY order_year

45
Carry out the same procedure to add the Big SQL connection for the 'Top 5 salesmen'
data set and examine the query.
WITH base (o_clerk, tot) AS
(SELECT o_clerk, SUM(o_totalprice) AS tot FROM orders GROUP BY o_clerk
ORDER BY tot DESC)
SELECT o_clerk, tot FROM base FETCH FIRST 5 ROWS ONLY

 Note: Disregard the red ‘X’ that may still exist on the Data Sets. This is a bug and can safely be ignored.

Now that we have defined the Data Source and have the Data Sets configured, run the
report in Web Viewer as shown in the diagram below.

The output from the web viewer against the orders table on Big SQL should be as follows.

46
As seen in this part of the lab, a variety of IBM and non-IBM software that supports JDBC
and ODBC data sources can also be configured to work with Big SQL. We used BIRT here,
but as another example, Cognos Business Intelligence can uses Big SQL's JDBC interface
to query data, generate reports, and perform other analytical functions. Similarly, other tools
like Tableau can leverage Big SQL’s ODBC drivers to work with data stored in a Big Insights
cluster.

47
Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding
contributions to Information Management, Business Analytics, and
Enterprise Content Management communities
• ibm.com/champion

Thank You!

Your Feedback is Important!
• Access the Conference Agenda Builder to complete your session
surveys
o Any web or mobile browser at
http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite

48
Acknowledgements and Disclaimers:
Availability: References in this presentation to IBM products, programs, or services do not
imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and
reflect their own views. They are provided for informational purposes only, and are neither
intended to, nor shall have the effect of being, legal or other guidance or advice to any
participant. While efforts were made to verify the completeness and accuracy of the
information contained in this presentation, it is provided AS-IS without warranty of any kind,
express or implied. IBM shall not be responsible for any damages arising out of the use of, or
otherwise related to, this presentation or any other materials. Nothing contained in this
presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of
the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have
used IBM products and the results they may have achieved. Actual environmental costs and
performance characteristics may vary by customer. Nothing contained in these materials is
intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2013. All rights reserved.
•
U.S. Government Users Restricted Rights - Use, duplication or disclosure
restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or
registered trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other IBM trademarked terms
are marked on their first occurrence in this information with a trademark symbol
(® or ™), these symbols indicate U.S. registered or common law trademarks
owned by IBM at the time this information was published. Such trademarks may
also be registered or common law trademarks in other countries. A current list of
IBM trademarks is available on the Web at “Copyright and trademark
information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of
others.

49

Contenu connexe

Tendances

Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Nicolas Morales
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019Dave Stokes
 
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSOracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSChristian Gohmann
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBasephanleson
 
Build Application With MongoDB
Build Application With MongoDBBuild Application With MongoDB
Build Application With MongoDBEdureka!
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoDave Stokes
 
How HarperDB Works
How HarperDB WorksHow HarperDB Works
How HarperDB WorksHarperDB
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operationsphanleson
 
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019Dave Stokes
 
Oracle Database 12c - New Features for Developers and DBAs
Oracle Database 12c  - New Features for Developers and DBAsOracle Database 12c  - New Features for Developers and DBAs
Oracle Database 12c - New Features for Developers and DBAsAlex Zaballa
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designphanleson
 
Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database managementLeyi (Kamus) Zhang
 
Sql portfolio admin_practicals
Sql portfolio admin_practicalsSql portfolio admin_practicals
Sql portfolio admin_practicalsShelli Ciaschini
 
Oracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarOracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarZohar Elkayam
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018Dave Stokes
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIParesh Nayak,OCP®,Prince2®
 

Tendances (20)

Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
hbase lab
hbase labhbase lab
hbase lab
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSOracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTS
 
Hbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBaseHbase in action - Chapter 09: Deploying HBase
Hbase in action - Chapter 09: Deploying HBase
 
Build Application With MongoDB
Build Application With MongoDBBuild Application With MongoDB
Build Application With MongoDB
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
 
How HarperDB Works
How HarperDB WorksHow HarperDB Works
How HarperDB Works
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
 
Oracle Database 12c - New Features for Developers and DBAs
Oracle Database 12c  - New Features for Developers and DBAsOracle Database 12c  - New Features for Developers and DBAs
Oracle Database 12c - New Features for Developers and DBAs
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
 
Sql portfolio admin_practicals
Sql portfolio admin_practicalsSql portfolio admin_practicals
Sql portfolio admin_practicals
 
Oracle Data Guard Broker Webinar
Oracle Data Guard Broker WebinarOracle Data Guard Broker Webinar
Oracle Data Guard Broker Webinar
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018
 
Oracle Complete Interview Questions
Oracle Complete Interview QuestionsOracle Complete Interview Questions
Oracle Complete Interview Questions
 
Architecture of exadata database machine – Part II
Architecture of exadata database machine – Part IIArchitecture of exadata database machine – Part II
Architecture of exadata database machine – Part II
 
Presentation day4 oracle12c
Presentation day4 oracle12cPresentation day4 oracle12c
Presentation day4 oracle12c
 

En vedette

HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101Nick Dimiduk
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 

En vedette (6)

HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 

Similaire à Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

Big Data: Big SQL web tooling (Data Server Manager) self-study lab
Big Data:  Big SQL web tooling (Data Server Manager) self-study labBig Data:  Big SQL web tooling (Data Server Manager) self-study lab
Big Data: Big SQL web tooling (Data Server Manager) self-study labCynthia Saracco
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study labCynthia Saracco
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...Leons Petražickis
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migrationAmit Sharma
 
DBA, LEVEL III TTLM Monitoring and Administering Database.docx
DBA, LEVEL III TTLM Monitoring and Administering Database.docxDBA, LEVEL III TTLM Monitoring and Administering Database.docx
DBA, LEVEL III TTLM Monitoring and Administering Database.docxseifusisay06
 
Drupal Continuous Integration with Jenkins - Deploy
Drupal Continuous Integration with Jenkins - DeployDrupal Continuous Integration with Jenkins - Deploy
Drupal Continuous Integration with Jenkins - DeployJohn Smith
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi
 
HPE NonStop SQL WebDBS - Introduction
HPE NonStop SQL WebDBS - IntroductionHPE NonStop SQL WebDBS - Introduction
HPE NonStop SQL WebDBS - IntroductionFrans Jongma
 
Percona Cluster with Master_Slave for Disaster Recovery
Percona Cluster with Master_Slave for Disaster RecoveryPercona Cluster with Master_Slave for Disaster Recovery
Percona Cluster with Master_Slave for Disaster RecoveryRam Gautam
 
How to install Vertica in a single node.
How to install Vertica in a single node.How to install Vertica in a single node.
How to install Vertica in a single node.Anil Maharjan
 
Windows logging cheat sheet
Windows logging cheat sheetWindows logging cheat sheet
Windows logging cheat sheetMichael Gough
 
WebSphere Portal Version 6.0 Web Content Management and DB2 Tuning Guide
WebSphere Portal Version 6.0 Web Content Management and DB2 Tuning GuideWebSphere Portal Version 6.0 Web Content Management and DB2 Tuning Guide
WebSphere Portal Version 6.0 Web Content Management and DB2 Tuning GuideTan Nguyen Phi
 

Similaire à Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL (20)

Big Data: Big SQL web tooling (Data Server Manager) self-study lab
Big Data:  Big SQL web tooling (Data Server Manager) self-study labBig Data:  Big SQL web tooling (Data Server Manager) self-study lab
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
 
Mysql ppt
Mysql pptMysql ppt
Mysql ppt
 
IUG ATL PC 9.5
IUG ATL PC 9.5IUG ATL PC 9.5
IUG ATL PC 9.5
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migration
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
 
DBA, LEVEL III TTLM Monitoring and Administering Database.docx
DBA, LEVEL III TTLM Monitoring and Administering Database.docxDBA, LEVEL III TTLM Monitoring and Administering Database.docx
DBA, LEVEL III TTLM Monitoring and Administering Database.docx
 
instaling
instalinginstaling
instaling
 
instaling
instalinginstaling
instaling
 
instaling
instalinginstaling
instaling
 
instaling
instalinginstaling
instaling
 
Drupal Continuous Integration with Jenkins - Deploy
Drupal Continuous Integration with Jenkins - DeployDrupal Continuous Integration with Jenkins - Deploy
Drupal Continuous Integration with Jenkins - Deploy
 
PostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / ShardingPostgreSQL Table Partitioning / Sharding
PostgreSQL Table Partitioning / Sharding
 
HPE NonStop SQL WebDBS - Introduction
HPE NonStop SQL WebDBS - IntroductionHPE NonStop SQL WebDBS - Introduction
HPE NonStop SQL WebDBS - Introduction
 
Percona Cluster with Master_Slave for Disaster Recovery
Percona Cluster with Master_Slave for Disaster RecoveryPercona Cluster with Master_Slave for Disaster Recovery
Percona Cluster with Master_Slave for Disaster Recovery
 
How to install Vertica in a single node.
How to install Vertica in a single node.How to install Vertica in a single node.
How to install Vertica in a single node.
 
Windows logging cheat sheet
Windows logging cheat sheetWindows logging cheat sheet
Windows logging cheat sheet
 
Db2 tutorial
Db2 tutorialDb2 tutorial
Db2 tutorial
 
WebSphere Portal Version 6.0 Web Content Management and DB2 Tuning Guide
WebSphere Portal Version 6.0 Web Content Management and DB2 Tuning GuideWebSphere Portal Version 6.0 Web Content Management and DB2 Tuning Guide
WebSphere Portal Version 6.0 Web Content Management and DB2 Tuning Guide
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

  • 1. Hands on Lab Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL Session Number 1687 Piotr Pruski, IBM, piotr.pruski@ca.ibm.com ( Benjamin Leonhardi, IBM @ppruski) 1
  • 2. Table of Contents Lab Setup ............................................................................................................................ 3 Getting Started .................................................................................................................... 3 Administering the Big SQL and HBase Servers................................................................. 4 Part I – Creating Big SQL Tables and Loading Data ......................................................... 6 Background ..................................................................................................................... 6 One-to-one Mapping....................................................................................................... 9 Adding New JDBC Drivers ...................................................................................... 11 One-to-one Mapping with UNIQUE Clause................................................................. 13 Many-to-one Mapping (Composite Keys and Dense Columns)................................... 16 Why do we need many-to-one mapping? ................................................................. 17 Data Collation Problem............................................................................................. 19 Many-to-one Mapping with Binary Encoding.............................................................. 20 Many-to-one Mapping with HBase Pre-created Regions and External Tables ............ 22 Load Data: Error Handling ........................................................................................... 26 [OPTIONAL] HBase Access via JAQL ....................................................................... 27 PART II – A – Query Handling........................................................................................ 31 The Data........................................................................................................................ 31 Projection Pushdown .................................................................................................... 33 Predicate Pushdown ...................................................................................................... 34 Point Scan ................................................................................................................. 34 Partial Row Scan....................................................................................................... 35 Range Scan................................................................................................................ 35 Full Table Scan ......................................................................................................... 36 Automatic Index Usage................................................................................................. 37 Pushing Down Filters into HBase................................................................................. 38 Table Access Hints ....................................................................................................... 39 Accessmode .............................................................................................................. 39 PART II – B – Connecting to Big SQL Server via JDBC ................................................ 40 Business Intelligence and Reporting via BIRT............................................................. 41 Communities ..................................................................................................................... 48 Thank You! ....................................................................................................................... 48 Acknowledgements and Disclaimers................................................................................ 49 2
  • 3. Lab Setup This lab exercise uses the IBM InfoSphere BigInsights Quick Start Edition, v2.1. The Quick Start Edition uses a non-warranted program license, and is not for production use. The purpose of the Quick Start Edition is for experimenting with the features of InfoSphere BigInsights, while being able to use real data and run real applications. The Quick Start Edition puts no data limit on the cluster and there is no time limit on the license. The following table outlines the users and passwords that are pre-configured on the image: username root biadmin db2inst1 password password biadmin password Getting Started To prepare for the contents of this lab, you must go through the following process to start all of the Hadoop components. 1. Start the VMware image by clicking the “Power on this virtual machine” button in VMware Workstation if the VM is not already on. 2. Log into the VMware virtual machine using the following information user: biadmin password: biadmin 3. Double-click on the BigInsights Shell folder icon from the desktop of the Quick Start VM. This view provides you with quick links to access the following functions that will be used throughout the course of this exercise: Big SQL Shell HBase Shell Jaql Shell Linux gnome-terminal 3
  • 4. 4. Open the Terminal (gnome-terminal) and start the Hadoop components (daemons). Linux Terminal start-all.sh Note: This command may take a few minutes to finish. Once all components have started successfully as shown below you may move to the next section. … [INFO] Progress - 100% [INFO] DeployManager - Start; SUCCEEDED components: [zookeeper, hadoop, derby, hive, hbase, bigsql, oozie, orchestrator, console, httpfs]; Consumes : 174625ms Administering the Big SQL and HBase Servers BigInsights provides both command-line tools and a user interface to manage the Big SQL and HBase servers. In this section, we will briefly go over the user interface which is part of BigInsights Web Console. 1. Bring up the BigInsights web console by double clicking on the BigInsights WebConsole icon on the desktop of the VM and open the Cluster Status tab. Select HBase to view the status of HBase master and region servers. 2. Similarly, click on Big SQL from the same tab to view its status. 4
  • 5. 3. Use hbase-master and hbase-regionserver web interfaces to visualize tables, regions and other metrics. Go to the BigInsights Welcome tab and select “Access Secure Cluster Servers.” You may need to enable pop-ups from the site when prompted. Alternatively, point browser to the following bottom two URL’s noted in the image below. Some interesting information from the web interfaces are: HBase root directory • This can be used to find the size of an HBase table. List of tables with descriptions. 5
  • 6. Each table displays lists of regions with start and end keys. • This information can be used to compact or split tables as needed. Metrics for each region server. • These can be used to determine if there are hot regions which are serving the majority of requests to a table. Such regions can be split. It also helps determine the effects and effectiveness of block cache, bloom filters and memory settings. 4. Perform a health check of HBase and Big SQL which is different from the status checks done above. It verifies the health of the functionality. From the Linux gnome-terminal, issue the following commands. Linux Terminal $BIGINSIGHTS_HOME/bin/healthcheck.sh hbase [INFO] DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ] [INFO] Progress - Health check hbase [INFO] Deployer - Try to start hbase if hbase service is stopped... [INFO] Deployer - Double check whether hbase is started successfully... [INFO] @bivm - hbase-master(active) started, pid 6627 [INFO] @bivm - hbase-regionserver started, pid 6745 [INFO] Deployer - hbase service started [INFO] Deployer - hbase service is healthy [INFO] Progress - 100% [INFO] DeployManager - Health check; SUCCEEDED components: [hbase]; Consumes : 26335ms Linux Terminal $BIGINSIGHTS_HOME/bin/healthcheck.sh bigsql [INFO] [INFO] [INFO] [INFO] [INFO] [INFO] [INFO] 1121ms DeployCmdline - [ IBM InfoSphere BigInsights QuickStart Edition ] Progress - Health check bigsql @bivm - bigsql-server already running, pid 6949 Deployer - Ping Check Success: bivm/192.168.230.137:7052 @bivm - bigsql is healthy Progress - 100% DeployManager - Health check; SUCCEEDED components: [bigsql]; Consumes : Part I – Creating Big SQL Tables and Loading Data In this part of the lab, our main goal is to demonstrate a migration of a table from a relational database to Big Insights using Big SQL over HBase. We will understand how HBase handles row keys and some pitfalls that users may encounter when moving data from a relational database to HBase tables. We will also try some useful options like pre-creating regions to see how it can help with data loading and queries. We will also explore various ways to load data. Background 6
  • 7. In this lab, we will use one table from the Great Outdoors Sales Data Warehouse model (GOSALESDW), SLS_SALES_FACT. The details of the tables along with its primary key information are depicted in the figure below. SLS_SALES_FACT PK PK PK PK PK PK PK ORDER_DAY_KEY ORGANIZATION_KEY EMPLOYEE_KEY RETAILER_KEY RETAILER_SITE_KEY PROMOTION_KEY ORDER_METHOD_KEY SALES_ORDER_KEY SHIP_DAY_KEY CLOSE_DAY_KEY QUANTITY UNIT_COST UNIT_PRICE UNIT_SALE_PRICE GROSS_MARGIN SALE_TOTAL GROSS_PROFIT There is an instance of DB2 contained on this image which contains this table with data already loaded that we will use in our migration. From the Linux gnome-terminal, switch to the DB2 instance user as shown below. Linux Terminal su - db2inst1 Note: The password for the db2inst1 is password. Enter this when prompted. As db2inst1, connect to the pre-created database, gosales. Linux Terminal db2 CONNECT TO gosales Upon successful connection, you should see the following output on the terminal. Database Connection Information Database server SQL authorization ID Local database alias = DB2/LINUXX8664 10.5.0 = DB2INST1 = GOSALES Issue the following command to list all of the tables contained in this database. 7
  • 8. Linux Terminal db2 LIST TABLES Note: Here you will see three tables. Each one is essentially the same except with one key difference – the amount of data that is contained within them. The remaining instructions in this lab exercise will use the SLS_SALES_FACT_10P table simply for the fact that it has a smaller amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger tables with more data feel free to do so but just remember to change the names appropriately. Table/View ------------------------------SLS_SALES_FACT SLS_SALES_FACT_10P SLS_SALES_FACT_25P Schema --------------DB2INST1 DB2INST1 DB2INST1 Type ----T T T Creation time -------------------------2013-08-22-14.51.27.228148 2013-08-22-14.54.01.622569 2013-08-22-14.55.46.416787 3 record(s) selected. Examine how many rows we have in this table to ensure later everything will be migrated properly. Issue the following select statement. Linux Terminal db2 "SELECT COUNT(*) FROM sls_sales_fact_10p" You should expect 44603 rows in this table. 1 ----------44603 1 record(s) selected. Use the following describe command to view all of the columns and data types that are contained within this table. Linux Terminal db2 "DESCRIBE TABLE sls_sales_fact_10p" 8
  • 9. Column name ------------------------------ORDER_DAY_KEY ORGANIZATION_KEY EMPLOYEE_KEY RETAILER_KEY RETAILER_SITE_KEY PRODUCT_KEY PROMOTION_KEY ORDER_METHOD_KEY SALES_ORDER_KEY SHIP_DAY_KEY CLOSE_DAY_KEY QUANTITY UNIT_COST UNIT_PRICE UNIT_SALE_PRICE GROSS_MARGIN SALE_TOTAL GROSS_PROFIT Data type Column schema Data type name Length Scale Nulls --------- ------------------- ---------- ----- ----SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER DECIMAL DECIMAL DECIMAL DOUBLE DECIMAL DECIMAL 4 4 4 4 4 4 4 4 4 4 4 4 19 19 19 8 19 19 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 0 2 2 Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes 18 record(s) selected. One-to-one Mapping In this section, we will use Big SQL to do a one-to-one mapping of the columns in the relational DB2 table to an HBase table row key and columns. This is not a recommended approach; however, the goal of this exercise is to demonstrate the inefficiency and pitfalls that can occur with such a mapping. Big SQL supports both, one-to-one and many-to-one mappings. In a one-to-one mapping, the HBase row key and each HBase column is mapped to a single SQL column. In the following example, the HBase row key is mapped to the SQL column id. Similarly, the cq_name column within the cf_data column family is mapped to the SQL column ‘name’ and so on. To begin, first create a schema to keep our tables organized. Open the BigSQL Shell from the BigInsights Shell folder on desktop and use the create schema command to create a schema named gosalesdw. BigSQL Shell CREATE SCHEMA gosalesdw; 9
  • 10. Issue the following command in the same BigSQL shell that is open. This DDL statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. Notice all the column names are the same with the same data types. The column mapping section requires a mapping for the row key. HBase columns are identified using family:qualifier. BigSQL Shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY), cf_data:cq_ORGANIZATION_KEY mapped by (ORGANIZATION_KEY), cf_data:cq_EMPLOYEE_KEY mapped by (EMPLOYEE_KEY), cf_data:cq_RETAILER_KEY mapped by (RETAILER_KEY), cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY), cf_data:cq_PRODUCT_KEY mapped by (PRODUCT_KEY), cf_data:cq_PROMOTION_KEY mapped by (PROMOTION_KEY), cf_data:cq_ORDER_METHOD_KEY mapped by (ORDER_METHOD_KEY), cf_data:cq_SALES_ORDER_KEY mapped by (SALES_ORDER_KEY), cf_data:cq_SHIP_DAY_KEY mapped by (SHIP_DAY_KEY), cf_data:cq_CLOSE_DAY_KEY mapped by (CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_UNIT_COST mapped by (UNIT_COST), cf_data:cq_UNIT_PRICE mapped by (UNIT_PRICE), cf_data:cq_UNIT_SALE_PRICE mapped by (UNIT_SALE_PRICE), cf_data:cq_GROSS_MARGIN mapped by (GROSS_MARGIN), cf_data:cq_SALE_TOTAL mapped by (SALE_TOTAL), cf_data:cq_GROSS_PROFIT mapped by (GROSS_PROFIT) ); Big SQL supports a load from source command that can be used to load data from warehouse sources which we’ll use first. It also supports loading data from delimited files using a load hbase command which we will use later. 10
  • 11. Adding New JDBC Drivers The load from source command uses Sqoop internally to do the load. Therefore, before using the load command from a BigSQL shell, we need first add the driver for the JDBC source into 1) the Sqoop library directory, and 2) the JSQSH terminal shared directory. From a Linux gnome-terminal, issue the following command (as biadmin) to add the JDBC driver JAR file to access the database to the $SQOOP_HOME/lib directory. Linux Terminal cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib From the BigSQL shell, examine the drivers currently loaded for the JSQSH terminal. BigSQL Shell drivers Terminate the BigSQL shell with the quit command. BigSQL Shell quit Copy the same DB2 driver to the JSQSH share directory with the following command. Linux Terminal cp /opt/ibm/db2/V10.5/java/db2jcc.jar $BIGINSIGHTS_HOME/bigsql/jsqsh/share/ When a user adds new drivers, the Big SQL server must be restarted. You could do this either from the web console, or use the follow command from the Linux gnome-terminal. Linux Terminal stop.sh bigsql && start.sh bigsql Open the BigSQL Shell from the BigInsights Shell folder on desktop once again since it was closed in our earlier step with the quit command and check if in fact the driver was loaded into JSQSH. BigSQL Shell drivers Now that the drivers have been set, the load can finally take place. The load from source statement extracts data from a source outside of an InfoSphere BigInsights cluster (DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table. Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the SLS_SALES_FACT table we have defined in BigSQL. BigSQL Shell 11
  • 12. LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/GOSALES' WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE gosalesdw.sls_sales_fact APPEND; You should expect to load 44603 rows which is the same number of rows that the select count statement on the original DB2 table verified earlier. 44603 rows affected (total: 1m37.74s) Try to verify this with a select count statement as shown. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.sls_sales_fact; Notice there is a discrepancy between the results from the load operation and the select count statement. +----+ | | +----+ | 33 | +----+ 1 row in results(first row: 3.13s; total: 3.13s) Also verify from an HBase shell. Open the HBase Shell from the BigInsights Shell folder on desktop and issue the following count command to verify the number of rows. HBase Shell count 'gosalesdw.sls_sales_fact' It should be apparent that the results from the Big SQL statement and HBase commands conform to one another. 33 row(s) in 0.7000 seconds However, this doesn’t yet explain why there is a mismatch between the number of loaded rows and the number of retrieved rows when we query the table. The load (and insert -- to be examined later) command behaves like upsert. Meaning, if a row with the same row key exists, HBase will write the new value as a new version for that column/cell. When querying the table, only latest value is returned by Big SQL. In many cases, this behaviour could be confusing. As with our case, we tried to load data with repeating values for a row key from a DB2 table with 44603 rows, the load reported 44603 rows affected. However, the select count(*) showed fewer rows; 33 to be exact. No errors are thrown in such scenarios therefore it is always recommended to cross-check the number of rows by querying the table as we did. Now that we understand that all the rows are actually versioned in HBase, we can examine a possible way to retrieve all versions of a particular row. 12
  • 13. First, from the BigSQL shell, issue the following select query with a predicate on the order day key. In the original table, there are most likely many tuples with the same order day key. BigSQL Shell SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE order_day_key = 20070720; As expected, we only retrieve one row, which is the latest or newest version of the row inserted into HBase with the specified order day key. +------------------+ | organization_key | +------------------+ | 11171 | +------------------+ 33 row(s) in 0.7000 seconds Using the HBase shell, we can retrieve previous versions for a row key. Use the following get command to get the top 4 versions of the row with row key 20070720. HBase Shell get 'gosalesdw.sls_sales_fact', '20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4} Since the previous command specified only 4 versions (VERSIONS => 4), we only retrieve 4 rows in the output. COLUMN cf_data:cq_ORGANIZATION_KEY value=11171 cf_data:cq_ORGANIZATION_KEY value=11171 cf_data:cq_ORGANIZATION_KEY value=11171 cf_data:cq_ORGANIZATION_KEY value=11171 4 row(s) in 0.0360 seconds CELL timestamp=1383365546430, timestamp=1383365546429, timestamp=1383365546428, timestamp=1383365546427, Optionally try the same command again specifying a larger version number. For example, VERSIONS => 100. Either way, most likely, this is not the intended behaviour that users may expect when performing such migration. They probably wanted to get all the data into the HBase table without versioned cells. There are a couple of solutions for this. One is to define the table with a composite row key to enforce uniqueness which will be explored later in this lab. Another option, outlined in the next section, is to force each row key to be unique by appending a UUID. One-to-one Mapping with UNIQUE Clause 13
  • 14. Another option while performing such a migration is to use the force key unique option when creating the table using BigSQL syntax. This option will force the load to add a UUID to the row key. It helps to prevent versioning of cells. However, this method is quite inefficient as it stores more data and also makes queries slower. Issue the following command in the BigSQL shell. This statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. This DDL statement is almost identical to what was seen in the previous section with one exception: the force key unique clause is specified for the column mapping of the row key. BigSQL Shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY) force key unique, cf_data:cq_ORGANIZATION_KEY mapped by (ORGANIZATION_KEY), cf_data:cq_EMPLOYEE_KEY mapped by (EMPLOYEE_KEY), cf_data:cq_RETAILER_KEY mapped by (RETAILER_KEY), cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY), cf_data:cq_PRODUCT_KEY mapped by (PRODUCT_KEY), cf_data:cq_PROMOTION_KEY mapped by (PROMOTION_KEY), cf_data:cq_ORDER_METHOD_KEY mapped by (ORDER_METHOD_KEY), cf_data:cq_SALES_ORDER_KEY mapped by (SALES_ORDER_KEY), cf_data:cq_SHIP_DAY_KEY mapped by (SHIP_DAY_KEY), cf_data:cq_CLOSE_DAY_KEY mapped by (CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_UNIT_COST mapped by (UNIT_COST), cf_data:cq_UNIT_PRICE mapped by (UNIT_PRICE), cf_data:cq_UNIT_SALE_PRICE mapped by (UNIT_SALE_PRICE), cf_data:cq_GROSS_MARGIN mapped by (GROSS_MARGIN), cf_data:cq_SALE_TOTAL mapped by (SALE_TOTAL), cf_data:cq_GROSS_PROFIT mapped by (GROSS_PROFIT) ); 14
  • 15. In the previous section, we used the load from source command to get the data from our table on DB2 source into HBase. This may not always be feasible which is why in this section we explore another loading statement, load hbase. This will load data into HBase using flat files – which perhaps is an export of the data form the relational source. Issue the following statement which will load data from a file into an InfoSphere BigInsights HBase table. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.sls_sales_fact_unique;  Note: The load hbase command can take in an optional list of columns. If no column list is specified, it will use the column ordering in table definition. The input file can be on DFS or on the local file system where Big SQL server is running. Once again, you should expect to load 44603 rows which is the same number of rows that the select count statement on the original DB2 table verified. 44603 rows affected (total: 26.95s) Verify the number of rows loaded with a select count statement as shown. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique; This time there is no discrepancy between the results from the load operation and the select count statement. +-------+ | | +-------+ | 44603 | +-------+ 1 row in results(first row: 1.61s; total: 1.61s) Issue the same count from the HBase shell to be sure. HBase Shell count 'gosalesdw.sls_sales_fact_unique' The values are persistent across load, select, and count. ... 44603 row(s) in 6.8490 seconds As in the previous section, from the BigSQL shell, issue the following select query with a predicate on the order day key. BigSQL Shell 15
  • 16. SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE order_day_key = 20070720; In the previous section, only one row was returned for the specified date. This time, expect to see 1405 rows since the rows are now forced to be unique due to our clause in the create statement and therefore no versioning should be applied. 1405 rows in results(first row: 0.47s; total: 0.58s) Once again, as in the previous section, we can check from the HBase shell if there are multiple versions of the cells. Issue the following get statement to attempt to retrieve the top 4 versions of the row with row key 20070720. HBase Shell get 'gosalesdw.sls_sales_fact_unique', '20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4} Zero rows are returned because the row key of 20070720 doesn’t exist. This is due to the fact we have appended the UUID to each row key; (20070720 + UUID). COLUMN 0 row(s) in 0.0850 seconds CELL Therefore, instead, issue the follow HBase command to do a scan versus a get. This will scan the table using the first part of the row key. We are also indicating scanner specifications of start and stop row values to only return the results we are interested in retrieving. HBase Shell scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => '20070720', STOPROW => '20070721'} Notice there are no discrepancies between the results from Big SQL select and HBase scan. 1405 row(s) in 12.1350 seconds Many-to-one Mapping (Composite Keys and Dense Columns) This section is dedicated to the other option of trying to enforce uniqueness of the cells and that is to define a table with a composite row key (aka many-to-one mapping). In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row key or a column). There are two terms that may be used frequently: composite key and dense column. A composite key is an HBase row key that is mapped to multiple SQL columns. A dense column is an HBase column that is mapped to multiple SQL columns. In the following example, the row key contains two parts – userid and account number. Each part corresponds to a SQL column. Similarly, the HBase columns are mapped to multiple 16
  • 17. SQL columns. Note that we can have a mix. For example, we can have a composite key, a dense column and a non-dense column or any mix of these. key 11111_ac11 userid acc_no Column Family: cf_data cq_acct cq_names fname1_lname1 first_na me HBase 11111#11#0.25 last_na me balanc min_ba intere SQL Issue the following DDL statement from the BigSQL shell which represents all entities from our relational table using a many-to-one mapping. Take notice of the column mapping section where multiple columns can be mapped to single family:qualifier’s. BigSQL Shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) ); Why do we need many-to-one mapping? HBase stores a lot of information for each value. For each value stored, a key consisting of the row key, column family name, column qualifier and timestamp are also stored. This means a lot of duplicate information is kept. HBase is very verbose and it is primarily intended for sparse data. In most cases, data in the relational world is not sparse. If we were to store each SQL column individually on HBase, as in our previous two sections, the required storage space would exponentially grow. When querying that data back, the query also returns the entire key (meaning, the row key, column family, and column qualifier) for each value. As an example, after loading data into this table we will examine the storage space for each of the three tables created thus far. 17
  • 18. As in the previous section, issue the following statement which will load data from a file into the InfoSphere BigInsights HBase table. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.sls_sales_fact_dense; Notice, the number of rows loaded into a table with many-to-one mapping remains the same even though we are storing less data! This statement also executes much faster than the previous load for this exact reason. 44603 rows affected (total: 3.42s) Issue the same statements and commands from both the BigSQL and HBase shell’s as in the previous two sections to verify that the number of rows is the same as in the original dataset. All of the results should be the same as in the previous section. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense; +-------+ | | +-------+ | 44603 | +-------+ 1 row in results(first row: 0.93s; total: 0.93s) BigSQL Shell SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE order_day_key = 20070720; 1405 rows in results(first row: 0.65s; total: 0.68s) HBase Shell scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'} 1405 row(s) in 4.3830 seconds As noted earlier, one-to-one mapping leads to use of too much storage space for the same data mapped using composite keys or dense column where the HBase row key or HBase column(s) are made up of multiple relational table columns. This is because HBase would repeat row key, column family name, column name and timestamp for each column value. For relational data which is usually dense, this would cause an explosion in the required storage space. Issue the following command as biadmin from a Linux gnome-terminal to check the directory sizes for the three tables we created thus far. 18
  • 19. Linux Terminal hadoop fs -du /hbase/ … 17731926 3188 47906322 … hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique Notice that the dense table is significantly smaller than the others. The table in which we forced uniqueness is the largest since it needs to append a UUID to each row key. Data Collation Problem All data represented thus far has been stored as strings. That is the default encoding on HBase tables created by BigSQL. Therefore, numeric data is not collated correctly. HBase uses lexicographic ordering, so you may run into cases where a query returns wrong results. The following scenario walks through a situation where data is not collated correctly. Using the Big SQL insert into hbase statement, add the following row to the sls_sales_fact_dense table we previously defined and loaded data into. Notice that the date we are specifying as part of the ORDER_DAY_KEY column (which has data type int) is a lager numerical value and does not conform to any date standard since it contains an extra digit. BigSQL Shell INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (200707201, 11171, 4428, 7109, 5588, 30265, 5501, 605);  Note: Insert command is available for HBase tables. However, it is not a supported feature Issue a scan on the table with the following start and stop criteria. HBase Shell scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'} Take notice of the last three rows/cells returned from the output of this scan. The newly added row shows up in the scan even though its integer value is not between 20070720 and 20070721. 19
  • 20. 200707201x0011171x004428x007109x005588x003 column=cf_data:cq_DOLLAR_VALUES, timestamp=1376692067977, value= 0264x005501x00605 200707201x0011171x004428x007109x005588x003 column=cf_data:cq_OTHER_KEYS, timestamp=1376692067977, value= 0264x005501x00605 200707201x0011171x004428x007109x005588x003 column=cf_data:cq_QUANTITY, timestamp=1376692067977, value= 0264x005501x00605 1406 row(s) in 4.2400 seconds Now insert another row into the table with the following command. This time we are conforming to the date format of YYYYMMDD and incrementing the day by 1 from the last value returned in the table; i.e., 20070721. BigSQL Shell INSERT INTO gosalesdw.sls_sales_fact_dense (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) VALUES (20070721, 11171, 4428, 7109, 5588, 30265, 5501, 605); Issue another scan on the table. Keep in mind to increase the stoprow criteria by 1 day. HBase Shell scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070722'} Now notice that the newly added row is included as part of the result set, and the row with ORDER_DAY_KEY of 200707201 is after the row with ORDER_DAY_KEY of 20070721. This is an example of numeric data is not collated properly. Meaning, the rows are not being stored in numerical order as one might expect but rather in byte lexicographical order. 200707201x0011171x004428x007109x005588x003 timestamp=1376692067977, value= 0264x005501x00605 200707201x0011171x004428x007109x005588x003 timestamp=1376692067977, value= 0264x005501x00605 200707201x0011171x004428x007109x005588x003 timestamp=1376692067977, value= 0264x005501x00605 20070721x0011171x004428x007109x005588x0030 timestamp=1376692480966, value= 265x005501x00605 20070721x0011171x004428x007109x005588x0030 timestamp=1376692480966, value= 265x005501x00605 20070721x0011171x004428x007109x005588x0030 timestamp=1376692480966, value= 265x005501x00605 1407 row(s) in 2.8840 seconds column=cf_data:cq_DOLLAR_VALUES, column=cf_data:cq_OTHER_KEYS, column=cf_data:cq_QUANTITY, column=cf_data:cq_DOLLAR_VALUES, column=cf_data:cq_OTHER_KEYS, column=cf_data:cq_QUANTITY, Many-to-one Mapping with Binary Encoding 20
  • 21. Big SQL supports two types of data encodings: string and binary. Each HBase entity can also have its own encoding. For example, a row key can be encoded as a string, one HBase column can be encoded as binary and another as string. String is the default encoding used in Big SQL HBase tables. The value is converted to string and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity, separators are used to delimit data. The default separator is the null byte. As it is the lowest byte, it maintains data collation and allows range queries and partial row scans to work correctly. Binary encoding in Big SQL is sort-able, so numeric data including negative number collate properly. It handles separators internally and avoids issues of separators existing within data by escaping it. Issue the following DDL statement from the BigSQL shell to create a dense table as we did in the previous section, but this time overriding the default encoding to binary. BigSQL Shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) ) default encoding binary; Once again, use the load hbase data command to load the data into the table. This time we are adding the DISABLE WAL clause. By using the option to disable WAL (write-ahead log), writes into HBase can be sped up. However, this is not a safe option. Turning off WAL can result in data loss if a region server crashes. Another possible option to speed up load is to increase the write buffer size. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.sls_sales_fact_dense_binary DISABLE WAL; 21
  • 22. 44603 rows affected (total: 5.54s) Issue a select statement on the newly created and loaded table with binary encoding, sls_sales_fact_dense_binary. BigSQL Shell SELECT * FROM gosalesdw.sls_sales_fact_dense_binary go –m discard;  Note: The “go –m discard” option is used so that the results of the command will not be displayed in the terminal. 44603 rows in results(first row: 0.35s; total: 2.89s) Issue another select statement on the previous table that has string encoding, sls_sales_fact_dense. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense go –m discard; 44605 rows in results(first row: 0.31s; total: 3.1s) One main point to see here is that the query can return faster. (Numeric types are also collated properly).  Note: You will probably not see much, if any, performance differences in this lab exercise since we are working with such a small dataset. There is no custom serialization/deserialization logic required for string encoding. This makes it portable in the case one would want to use another application to read data in HBase tables. A main use case for string encoding is when someone wants to map existing data. Delimited data is a very common form of storing data and it can be easily mapped using Big SQL string encoding. However, parsing strings is expensive and queries with data encoded as strings are slow. Also, numeric data is not collated correctly as seen. Queries on data encoded as binary have faster response times. Numeric data, including negative numbers, are also collated correctly with binary encoding. The downside is you get data encoded by Big SQL logic and may not be portable as-is. Many-to-one Mapping with HBase Pre-created Regions and External Tables HBase automatically handles splitting regions when they reach a set limit. In some scenarios like bulk loading, it is more efficient to pre-create regions so that the load operation can take place in parallel. The data for sales is 4 months, April through July for the year 2007. We can pre-create regions by specifying splits in create table command. 22
  • 23. In this section, we will create a table within the HBase shell with pre-defined splits, not using any Big SQL features at first. Than we will showcase how users can map existing data in HBase to Big SQL which can prove to be a very common practice. This is made possible by creating what is called external tables. Start by issuing the following statement in the HBase shell. This will create the sls_sales_fact_dense_split table with pre-defined region splits for April through July in 2007. HBase Shell create 'gosalesdw.sls_sales_fact_dense_split', {NAME => 'cf_data', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'NONE', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE => '65536'}, {SPLITS => ['200704', '200705', '200706', '200707']} Issue the following list command on the HBase shell to verify the newly created table. HBase Shell list Note that if we were to list the tables from the Big SQL shell, we would not see this table because we have not made any association yet to Big SQL. Open and point a browser to the following URL: http://bivm:60010/. Scroll down and click on the table we had just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split. 23
  • 24. Examine the pre-created regions for this table as we had defined when creating the table. Execute the following create external hbase command to map the existing table we have just created in HBase to Big SQL. Some thing to note about the command: The create table statement allows specifying a different name for SQL table through hbase table name clause. Using external tables, you can also create multiple views of same HBase table. For example, one table can map to few columns and another table to another set of columns etc. Notice the column mapping section of the create table statement allows specifying a different separator for each column and row key. Another place where external tables can be used is to map tables created using Hive HBase storage handler. These cannot be directly read using Big SQL storage handler. BigSQL Shell 24
  • 25. CREATE EXTERNAL HBASE TABLE GOSALESDW.EXTERNAL_SLS_SALES_FACT_DENSE_SPLIT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) SEPARATOR '-', cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY) SEPARATOR '/', cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR '|' ) HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split'; The data in external tables is not validated at creation time. For example, if a column in the external table contains data with separators incorrectly defined, the query results would be unpredictable.  Note: External tables are not owned by Big SQL and hence cannot be dropped via Big SQL. Also, secondary indexes cannot be created via Big SQL on external tables. Use the following command to load the external table we have defined. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split; 44603 rows affected (total: 1m57.2s) Verify that the same number of rows loaded is also the same number of rows returned by querying the external SQL table. BigSQL Shell SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split; 25
  • 26. +--------+ | | +--------+ | 446023 | +--------+ 1 row in results(first row: 6.44s; total: 6.46s) Verify the same from the HBase shell directly on the underlying HBase table. HBase Shell count 'gosalesdw.sls_sales_fact_dense_split' ... 44603 row(s) in 9.1620 seconds Issue a get command from the HBase shell specifying the row key as follows. Notice the separator between each part of the row key is a “-” which is what we defined when originally creating the external table. HBase Shell get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-71095588-30263-5501-605' In the following output you can also see the other seperators we defined for the external table. “|” for the cq_DOLLAR_VALUE, and “/” for cq_QUANTITY. COLUMN cf_data:cq_DOLLAR_VALUES value=33.59|62.65|62.65|0.4638|1566.25|726.50 cf_data:cq_OTHER_KEYS value=481896/20070723/20070723 cf_data:cq_QUANTITY 3 row(s) in 0. 0610 seconds CELL timestamp=1376690502630, timestamp=1376690502630, timestamp=1376690502630, value=25 Of course in Big SQL we don't need to specify the separators such as “-” when querying against the table as with the command below. BigSQL Shell SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE ORDER_DAY_KEY = 20070720 AND ORGANIZATION_KEY = 11171 AND EMPLOYEE_KEY = 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND PRODUCT_KEY = 30263 AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY = 605; Load Data: Error Handling In this final section of the part of the lab, we will examine how to handle errors during the load operation. The load hbase command has an option to continue past errors. The LOG ERROR ROWS IN FILE clause can be used to specify a file name to log any rows that could not be loaded 26
  • 27. because of errors. Some of the common errors are invalid numeric types, and a separator existing within the data for string encoding. Linux Terminal hadoop fs -cat /user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt 2007072a … … … b0070720 … … … 2007-07-20 … … … 20070720 … … … 20070721 … … … 11171 … … … … … … … … … … … … … 11171 … … … … … … … … … … … 11171 … … … … … … … … … … … 11-71 … … … … … … … … … … … 11171 … … … … … … … … … … … … … … … … … … … Note that separator appearing within the data is an issue with string encoding. Knowing there are errors with the input data, proceed to issue the following load command, specifying a directory and file where to put the “bad” rows. BigSQL Shell LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt' DELIMITED FIELDS TERMINATED BY 't' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE '/tmp/SLS_SALES_FACT_load.err'; In this example, 4 rows did not get loaded because of errors. Note that load reports all the rows that passed through it 1 row affected (total: 2.74s) Examine the specified file in the load command to view the rows which we not loaded. Linux Terminal hadoop fs -cat /tmp/SLS_SALES_FACT_load.err "2007072a","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" "b0070720","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" "2007-07-20","11171","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" "20070720","11-71","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…","…" [OPTIONAL] HBase Access via JAQL Jaql has an HBase module that can be used to create and insert data into HBase tables and query them efficiently using multiple modes - local mode that directly access HBase as well as map reduce mode. It allows specifying query optimization options similar to what is available in hbase shell. The capability to transparently use map reduce jobs makes it work well with bigger tables. At the same time, users can force local mode when they run point or 27
  • 28. range queries. It allows use of a SQL language subset termed as Jaql SQL which provides the capability to join, perform grouping and other aggregations on tables. It also provides access to data from different sources such as relational DBMS and different formats like delimited files, Avro and anything that is supported by Jaql. The results of the query can be written in different formats to HDFS and read by other BigInsights applications like BigSheets for further analysis. In this section, we’ll first pull information from our relational DMBS and than go over use of Jaql HBase module, specifically the additional features that it provides. Start by opening a Jaql shell. You can open the same (JSQSH) terminal that was used for Big SQL by adding the “--jaql" option as shown below. This is a much better environment to work with than the standard Jaql Shell as it provides features like previous command using the up arrow key and you can also traverse through your commands using the left/right arrow keys. Linux Terminal /opt/ibm/biginsights/bigsql/jsqsh/bin/jsqsh --jaql; Once in the JSQSH shell with Jaql option, load the dbms::jdbc driver with the following command. BigSQL/JAQL Shell import dbms::jdbc; Add the JDBC driver JAR file to the classpath. BigSQL/JAQL Shell addRelativeClassPath(getSystemSearchPath(), '/opt/ibm/db2/V10.5/java/db2jcc.jar'); Supply the connection information. BigSQL/JAQL Shell db := jdbc::connect( driver = 'com.ibm.db2.jcc.DB2Driver', url = 'jdbc:db2://localhost:50000/gosales', properties = {user: "db2inst1", password: "password"} ); Specify the rows to be retrieved with a SQL select statement. BigSQL/JAQL Shell DESC := jdbc::prepare( db, query = "SELECT * FROM db2inst1.sls_sales_fact_10p"); In many-to-one mapping for row key, we went over creation of a composite key. In the next few steps, we will use Jaql to load the same data using a composite key and dense columns. We’ll pack all columns that make up primary key of the relational table into a HBase row key, and we’ll also pack other columns into dense HBase columns. Define a variable to read the original data from the relational JDBC source. This converts each tuple of the table into a JSON record. 28
  • 29. BigSQL/JAQL Shell ssf = localRead(DESC); Transform the record into the required format. Essentially we are doing the same procedure as when we defined the many-to-one mapping in the previous sections. For the first element, which we will use for HBase row key, concatenate the values of the columns that form the primary key of the sales fact table using a “-” separator. For the remaining columns, pack them into other dense HBase columns: cq_OTHER_KEYS (using “/” separator), cq_QUANTITY, and cq_DOLLAR_VALUES (using “|” separator). BigSQL/JAQL Shell ssft = ssf -> transform [$."ORDER_DAY_KEY", $."ORGANIZATION_KEY", $."EMPLOYEE_KEY", $."RETAILER_KEY", $."RETAILER_SITE_KEY", $."PRODUCT_KEY", $."PROMOTION_KEY", $."ORDER_METHOD_KEY", $."SALES_ORDER_KEY", $."SHIP_DAY_KEY", $."CLOSE_DAY_KEY", $."QUANTITY", $."UNIT_COST", $."UNIT_PRICE", $."UNIT_SALE_PRICE", $."GROSS_MARGIN", $."SALE_TOTAL", $."GROSS_PROFIT"] -> transform { key: strcat($[0],"-",$[1],"-",$[2],"-",$[3],"-",$[4],"-",$[5],"",$[6],"-",$[7]), cf_data: { cq_OTHER_KEYS: strcat($[8],"/",$[9],"/",$[10]), cq_QUANTITY: strcat($[11]), cq_DOLLAR_VALUES: strcat($[12],"|",$[13],"|",$[14],"|",$[15],"|",$[16],"|",$[17]) } }; Verify the data is in the correct format by querying the first record. BigSQL/JAQL Shell ssft -> top 1; { "key": "20070418-11114-4415-7314-5794-30124-5501-605", "cf_data": { "cq_OTHER_KEYS": "254121/20070423/20070423", "cq_QUANTITY": "60", "cq_DOLLAR_VALUES": "610.00m|1359.72m|1291.73m|0.5278|77503.80m|40903.80m" } } (1 row in 2.40s) Now we have the data ready to be written into HBase. First import the hbase module which prepares jaql by loading required jars and preparing the environment using the HBase configuration files. BigSQL/JAQL Shell import hbase(*); Use hbaseString to define a schema for the HBase table. The HBase table does not get created until something is written into it. An array of records that match the specified schema 29
  • 30. should be used to write into the HBase table. The data types correspond to how Jaql will interpret the data. BigSQL/JAQL Shell SSFHT = hbaseString('sales_fact2', schema { key: string, cf_data?: {*: string}}, create=true, replace=true, rowBatchSize=10000, colBatchSize=200 ); Note: As this (could be) a big table, specify rowBatchSize and colBatchSize which will be used for scanner catching and column batch size by the internal HBase scan object. The column batch size is useful when there are a huge number of columns in rows. Write to the table using the previously created ssft array which matches the specified schema. BigSQL/JAQL Shell ssft -> write(SSFHT); A write operation will create the HBase table, and populate it with the input data. To confirm, use hbase shell to count (or scan) the table and verify the data was written with the right number of rows. HBase Shell count 'sales_fact2' 44603 row(s) in 3.6230 seconds To read the contents of the HBase table using Jaql, use read on the hbaseString. In the following command we are also passing the read directly into a count function to verify the right number of rows. BigSQL/JAQL Shell count(read(SSFHT)); 44603 To query for rows matching a particular order day key 20070720, use setKeyRange for the partial range query. Use localRead for point and range queries as Jaql is tuned for local execution and performs efficiently. BigSQL/JAQL Shell localRead(SSFHT -> setKeyRange('20070720', '20070721')); Perform the same query using HBase shell. Both complete in similar amount of time. HBase Shell scan 'sales_fact2', {STARTROW => '20070720', STOPROW => '20070721', CACHE => 10000} 30
  • 31. To query for a row when we have the values for all primary key columns, we can construct the entire row key and perform a point query. BigSQL/JAQL Shell localRead(SSFHT -> setKey('20070720-11171-4428-7109-5588-30263-5501605')); Identically, this is what the statement would look like from the HBase shell. HBase Shell get 'sales_fact2', '20070720-11171-4428-7109-5588-30263-5501-605' To use a filter from Jaql, use setFilter function along with addFilter. In the below case, the predicate is on quantity column which is the leading part of the dense column and hence can be used in the predicate. BigSQL/JAQL Shell read(SSFHT -> setFilter([addFilter(filterType.SingleColumnValueFilter, HBaseKeyArrayToBinary(["481896/"]), compareOp.equal, comparators.BinaryPrefixComparator, "cf_data", "cq_OTHER_KEYS", true ) ]) ); PART II – A – Query Handling Efficiently querying HBase requires pushing as much to the server(s) as possible. This includes projection pushdown or fetching the minimal set of columns that are required by the query. It also includes pushing down query predicates into the server as scan limits, filters, index lookups, etc. Setting scan limits is extremely powerful as it can help to narrow down regions we need to scan. With a full row key, HBase can quickly pinpoint the region and the row. With partial keys and key ranges (upper, lower limits or both), HBase can narrow down regions or eliminate regions which fall outside the range. Indexes help to leverage this key lookup but they use two tables to achieve this. Filters cannot eliminate regions but some have capability to skip within a region. They help to narrow down the data set returned to the client. With limited metadata/statistics about HBase tables, supporting a variety of hints helps improve query efficiency. The Data This section describes the schema which the sample data will use to demonstrate the effects of pushdown from Big SQL. 31
  • 32. We will use a tpch table: orders table with 150,000 rows defined using the mapping shown below. Issue the following command from a Big SQL shell to create the orders table. Notice this table has a many-to-one mapping, meaning there is a composite key and dense columns. BigSQL Shell CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_CUSTKEY,O_ORDERKEY), cf:d mapped by (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_CO MMENT), cf:od mapped by (O_ORDERDATE) ) default encoding binary; Load the sample data into the newly created table by issuing the following command. Note: As in Part I, there are three sample sets provided for you. Each one is essentially the same except with one key difference. This is in the amount of data that is contained within them. The remaining instructions in this lab exercise will use the orders.10p.tbl dataset simply for the fact that it has a smaller amount of data and will be faster to work with for demonstration purposes. If you would like to use the larger tables with more data feel free to do so but just remember to change the names appropriately. BigSQL Shell LOAD HBASE DATA INPATH 'tpch/orders.10p.tbl' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS; 150000 rows affected (total: 21.52s) In next set of sections, we examine the output from Big SQL log files to point out what you can check for to confirm pushdown from Big SQL. To view log messages, you may have to first change logging levels using the below commands. BigSQL Shell log com.ibm.jaql.modules.hcat.mapred.JaqlHBaseInputFormat info; BigSQL Shell log com.ibm.jaql.modules.hcat.hbase info; Note that columns are pushed down at HBase level. So in many-to-one mappings, if the query requires only one part of a dense column with many parts, the entire value for dense 32
  • 33. column will be returned. Therefore it is efficient to pack together columns that are usually queried together. Use the following command to tail the Big SQL log file. Keep this open in a terminal throughout this entire part of this lab. We will be referring to it quite often to see what is going on behind the scenes when running certain commands. Linux Terminal tail -f /var/ibm/biginsights/bigsql/logs/bigsql.log Projection Pushdown The first query here does a SELECT * and requests all HBase columns used in the table mapping. The original HBase table could have a lot more columns; we may have defined an external table mapping to just a few columns. In such cases, only the HBase columns used in mapping will be retrieved. BigSQL Shell SELECT * FROM orders go -m discard; 150000 rows in results(first row: 1.73s; total: 10.69s) In the Big SQL log file, we can see that we returned data from both columns. BigSQL Log … …HBase scan details:{…, families={cf=[d, od]}, …, stopRow=, startRow=, totalColumns=2, …} This second query request only one HBase column: BigSQL Shell SELECT o_totalprice FROM orders go -m discard; Notice that the query returns much faster since we are returning much less data. 150000 rows in results(first row: 0.27s; total: 2.83s) Verify from the log file that this query only executed against one column. BigSQL Log … …HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, totalColumns=1, …} The third query request only one HBase column also. 33
  • 34. BigSQL Shell SELECT o_orderdate FROM orders go -m discard; Although this query actually returns lesser data, it actually has higher response time because serialization/deserialization of type timestamp is expensive. 150000 rows in results(first row: 0.37s; total: 4.5s) BigSQL Log … …HBase scan details:{…, families={cf=[od]}, …, stopRow=, startRow=, totalColumns=1, …} Predicate Pushdown Point Scan Identifying and using point scans is the most effective optimization for queries into HBase. For converting to point scan, we need to get the predicate value covering the full row key. This could come in as multiple predicates as Big SQL supports composite keys. The query analyzer in Big SQL is capable of combining multiple predicates to identify a full row scan. Currently, this analysis happens at run time in the storage handler. At that point, the decision of whether or not to use map reduce has already been made. To bypass map reduce, a user has to provide explicit local mode access hints currently. In the example below, the command “set force local on” makes sure all queries executing in the session do not use map reduce. BigSQL Shell set force local on; Issue the following select statement that will provide predicates for the columns that comprise of the full row key. They are custkey and orderkey. BigSQL Shell select o_orderkey, o_totalprice from orders where o_custkey=4 and o_orderkey=5612065; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 5612065 | 71845.25781 | +------------+--------------+ 1 row in results(first row: 0.18s; total: 0.18s) If we check the logs, you can see that Big SQL successfully took both predicates specified and combined them to do a row scan using all parts of the composite key. 34
  • 35. BigSQL Log … … Found a row scan by combining all composite key parts. … Found a row scan from row key parts … HBase filter list created using AND. … HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1): [PrefixFilter x01x80x00x00x04], …, stopRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!, startRow=x01x80x00x00x04x01x80x00x00x00x00UxA2!, totalColumns=1, …} Partial Row Scan This section shows the capability of Big SQL server to process predicates on leading parts of row key – and not necessarily the full row key as in the previous section. Issue the following example query that provides a predicate for the first part of the row key, custkey. BigSQL Shell select o_orderkey, o_totalprice from orders where o_custkey=4; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 5453440 | 17938.41016 | | 5612065 | 71845.25781 | +------------+--------------+ 2 rows in results(first row: 0.19s; total: 0.19s) Checking the logs, you can see the predicate on first part of row key is converted to a range scan. The stop row in the scan is non-inclusive. So it is internally appended with lowest possible byte to cover the partial range. BigSQL Log … … Found a row scan that uses the first 1 part(s) of composite key. … Found a row scan from row key parts … HBase filter list created using AND. … HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1): [PrefixFilter x01x80x00x00x04], …, stopRow=x01x80x00x00x04xFF, startRow=x01x80x00x00x01, totalColumns=1, …} Range Scan When there are range predicates, we can set the start or stop row or both. In our example query below we have a ‘less than’ predicate; therefore we only know the stop row. However, even setting this will help eliminate regions with row keys that fall above the stop row. Issue the following command. BigSQL Shell select o_orderkey, o_totalprice from orders where o_custkey < 15; 35
  • 36. +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 5453440 | 17938.41016 | | 5612065 | 71845.25781 | | 5805349 | 255145.51562 | | 5987111 | 97765.57812 | | 5692738 | 143292.53125 | | 5885190 | 125285.42969 | | 5693440 | 117319.15625 | | 5880160 | 198773.68750 | | 5414466 | 149205.60938 | | 5534435 | 136184.51562 | | 5566567 | 56285.71094 | +------------+--------------+ 11 rows in results(first row: 0.22s; total: 0.22s) Notice in the log file that similarly to the previous section, we are also only using the first part of the composite key since we are specifying custkey as the predicate. However, in this case since we only know the stop row (less than 3), there is no value for the start row portion of the scan. BigSQL Log … … Found a row scan that uses the first 1 part(s) of composite key. … Found a row scan from row key parts … … HBase scan details:{…, families={cf=[d]}, …, stopRow=x01x80x00x00x0F, startRow=, totalColumns=1, …} Full Table Scan This section simply shows an example of what happens when none of the predicates can be pushed down to HBase In this example query, the predicate (orderkey) is on non-leading part of row key and therefore is not pushed down. Issue the command to see this will result in a full table scan. BigSQL Shell select o_orderkey, o_totalprice from orders where o_orderkey=5612065; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 5612065 | 71845.25781 | +------------+--------------+ 1 row in results(first row: 1.90s; total: 1.90s) As can be determined by examining the logs, in cases where none of the predicates can be pushed to HBase, a full table scan is required. Meaning there are no specified values for either start or stop row. BigSQL Log 36
  • 37. … … HBase scan details:{…, families={cf=[d]}, …, stopRow=, startRow=, …} Automatic Index Usage This section will demonstrate the benefits of an index lookup. Before creating an index, let’s first execute a query that will invoke a full table scan so we can do a comparison later to see the performance benefits we can achieve by creating an index on particular column(s). Notice we are specifying a predicate on the clerk column which is the middle part of a dense column defined. BigSQL Shell SELECT * FROM orders WHERE o_clerk='Clerk#000000999' go -m discard; 154 rows in results(first row: 2.40s; total: 4.32s) As you can see below in the log file, there is no usage of an index. BigSQL Log … … indexScanInfo: [isIndexScan: false], valuesInfo: [minValue: undefined, minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo: [numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false], indexScanCandidateInfo: [hasIndexScanCandidate: false]] … Issue the following command to create the index on the clerk column which is the middle part of a dense column in table. This creates a new table to store index data. The index table stores the column value and row key it appears in. BigSQL Shell CREATE INDEX ix_clerk ON TABLE orders (o_clerk) AS 'hbase';  Note: The create index statement will create the new index table which uses <base_table_name>_<index_name> as its name, it deploys the coprocessor, populates the index table using map reduce index builder. The “as hbase” clause indicates the type of index handler to use. For HBase, we have a separate index handler. 0 rows affected (total: 1m17.47s) Re-issue the exact same command as we did earlier. BigSQL Shell SELECT * FROM orders WHERE o_clerk='Clerk#000000999' go -m discard; 37
  • 38. After creating the index and issuing the same select statement, Big SQL will automatically take advantage of the index that was created and avoids a full table scan which results in a much faster response time. 154 rows in results(first row: 0.73s; total: 0.74s) You can verify in the log file that Big SQL. In this case the index table is scanned for all matching rows that start with value of predicate clerk, in this case Clerk#000000999. From the matching row(s), the row key(s) of base table are extracted and get requests are batched and sent to the data table. BigSQL Log … … indexScanInfo: [isIndexScan: true, keyLookupType: point_query, indexDetails: JaqlHBaseIndex[indexName: ix_clerk, indexSpec: {"bin_terminator": "#","columns": [{"cf": "cf","col": "o_clerk","cq": "d","from_dense": "true"}],"comp_seperator": "%","composite": "false","key_seperator": "/","name": "ix_clerk"}, numColumns: 1, columns: [Ljava.lang.String;@3ced3ced, startValue: x01Clerk#000000999x00, stopValue: x01Clerk#000000999x00]], valuesInfo: [minValue: [B@4b834b83, minInclusive: false, maxValue: undefined, maxInclusive: false], filterInfo: [numFilters: 0], rowScanCandidateInfo: [hasRowScanCandidate: false], indexScanCandidateInfo: [hasIndexScanCandidate: true, indexScanCandidate: IndexScanCandidate[columnName: o_clerk,indexColValue: [B@4cda4cda,[operator: =,isVariableLength: false,type: null,encoding: BINARY]]] … Found an index scan from index scan candidates. Details: … Index name: ix_clerk … … Index query details: [indexSpec:ix_clerk, startValueBytes: #Clerk#000000999, stopValueBytes: #Clerk#000000999,baseTableScanStart:,baseTableScanStop:] … Index query successful.  Note: For a composite index where multiple columns are used to define an index, predicates are handled and pushed down similar to what is done for composite row keys. If there was no index, the predicate could not be pushed down as it is the non-leading part of a dense column. In such cases, a full table scan is required as seen at the beginning of this section. Pushing Down Filters into HBase Though HBase filters do not avoid full table scan, they limit the rows and data returned to the client. HBase filters have a skip facility which lets them skip over certain portions of data. Many of the inbuilt filters implement this and thus prove more efficient than a raw table scan. There are filters that can limit the data within a row. For example, when we need to only get columns in the key part of filter, some filters like FirstKeyOnlyFilter and KeyOnlyFilter can be applied to get only a single instance of the row key part of data. The sample query below will demonstrate a case where Big SQL pushes down a row scan and a column filter. BigSQL Shell 38
  • 39. SELECT o_orderkey FROM orders WHERE o_custkey>100000 AND o_orderstatus='P' go -m discard; 1278 rows in results(first row: 0.37s; total: 0.38s) Notice, the predicate on the custkey column triggers the row scan. The column filter, SingleColumnValueFilter, is triggered because there is a predicate on the leading part of a dense column (cf:d). BigSQL Log … … Found a row scan that uses the first 1 part(s) of composite key. … Found a row scan from row key parts … HBase filter list created using AND. … … HBase scan details:{…, families={cf=[d]}, filter=FilterList AND (1/1): [SingleColumnValueFilter (cf, d, EQUAL, x01Px00)], …, stopRow=, startRow=x01x80x01x86xA1, totalColumns=1, …} This way Big SQL can automatically convert predicates into many of these filters and thus handle queries more efficiently. Table Access Hints Access hints affect the strategy that is used to read the table, identify the source of the data, and how to optimize a query. For example, the strategy can affect such behaviour as whether MapReduce is employed to implement the join or whether a memory (hash) join is employed. These hints can also control how to access data from specific sources. The table access hint that we will explore here is: accessmode. Accessmode The accessmode hint is very important for HBase. It avoids map reduce overhead. Combined with point queries, they ensure sub-second response time without being affected by the total data size. There are multiple ways to specify accessmode hint – as query hint or at session level. Note that session level hints take precedence. If “set force local off;” is run in a session, all subsequent queries will always use map reduce even if an explicit accessmode=‘local’ hint is specified on the query. You can check the state of accessmode, if it was explicitly set, on the session with the following command in the Big SQL shell. BigSQL Shell set; If you kept the same shell open throughout this part of the lab, you will see the following output. This is because we used “set force local on” earlier in one of the previous sections. 39
  • 40. +--------------------+-------+ | key | value | +--------------------+-------+ | bigsql.force.local | true | +--------------------+-------+ 1 row in results(first row: 0.0s; total: 0.0s) To change the setting back to the default, you can change the value to automatic with the following command. BigSQL Shell set force local auto; Issue the following select query. BigSQL Shell select o_orderkey from orders where o_custkey=4 and o_orderkey=5612065; Notice how long the query takes. +------------+ | o_orderkey | +------------+ | 5612065 | +------------+ 1 row in results(first row: 7.2s; total: 7.2s) Issue the same query with an accessmode hint this time. BigSQL Shell select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=4 and o_orderkey=5612065; Notice how the query responds much faster with the results. This is because of the local accessmode, hence no mapreduce job employed. +------------+ | o_orderkey | +------------+ | 5612065 | +------------+ 1 row in results(first row: 0.32s; total: 0.32s) PART II – B – Connecting to Big SQL Server via JDBC Organizations interested in Big SQL often have considerable SQL skills in-house, as well as a suite of SQL-based business intelligence applications and query/reporting tools. The idea of being able to leverage existing skills and tools — and perhaps reuse portions of existing applications — can be quite appealing to organizations new to Hadoop. 40
  • 41. Therefore Big SQL supports a JDBC driver that conforms to the JDBC 3.0 specification to provide connectivity to Java™ applications. (Big SQL also supports a 32-bit or a 64-bit ODBC driver, on either Linux or Windows, that conforms to the Microsoft Open Database Connectivity 3.0.0 specification, to provide connectivity to C and C++ applications). In this part of the lab, we will explore how to use Big SQL’s JDBC driver with BIRT, an open source business intelligence and report tool that can plug into Eclipse. We will use this tool to run some very simple reports using SQL queries on data stored in HBase on our Hadoop environment. Business Intelligence and Reporting via BIRT To start, open eclipse from the Desktop of the virtual machine by clicking on the Eclipse icon. When promoted to do so, leave the default workspace as is. Once Eclipse has loaded switch to the 'Report Design' perspective so that we can work with BIRT. To do so, from the menu bar click on: Window -> Open Perspective -> Other.... Than click on: Report Design -> OK as shown below. Once in the Report Design perspective, double-click on Orders.rptdesign from the Navigator pane (on the bottom left-hand side) to open the pre-created report. 41
  • 42.  Note: A report has been created on your behalf to quicker illustrate the functionally/usage of the Big SQL drivers, while removing tedious steps of designing a report in BIRT. Expand 'Data Sets' from Data Explorer. You will notice the data sets (or report queries) contain a red 'X' beside them. This is because the pre-created report queries are not yet associated to a data source. Now all that is necessary, prior to being able to run the report, is to set up the JDBC connection to BigSQL. To obtain the client drivers, open the BigInsights web console from the Desktop of the VM, or point your browser to: http://bivm:8080. From the Welcome tab, in the Quick Links section, select Download the Big SQL Client drivers. Save the file to /home/biadmin/Desktop/IBD-1687A/. 42
  • 43. Open the folder where you saved the file and extract the contents of the client package under the same directory. Back in Eclipse, add Big SQL as a source. Right-click on Data Sources -> New Data Source from the Data Explorer pane on the top left-hand side. In the New Data Source window, select JDBC Data Source and specify “Big SQL” for the Data Source Name. Click Next. 43
  • 44. In the New JDBC Data Source Profile window, click on Manage Drivers…. Once the Manage JDBC Drivers window appears click on Add… Point to the location where the client drivers were extracted than click OK. Once added, you should have an entry for the BigSQLDriver in the Driver Class dropdown field list. Select it, and complete the fields with the following information: • • • Database URL: jdbc:bigsql://localhost:7052 User Name: biadmin Password: biadmin 44
  • 45. Click on ‘Test Connection...’ to ensure we can connect to Big SQL using the JDBC driver. Double-click 'Orders per year' and add the Big SQL connection that was just defined. Examine the query: WITH test (order_year, order_date) AS (SELECT YEAR(o_orderdate), o_orderdate FROM orders FETCH FIRST 20 ROWS ONLY) SELECT order_year, COUNT(*) AS cnt FROM test GROUP BY order_year 45
  • 46. Carry out the same procedure to add the Big SQL connection for the 'Top 5 salesmen' data set and examine the query. WITH base (o_clerk, tot) AS (SELECT o_clerk, SUM(o_totalprice) AS tot FROM orders GROUP BY o_clerk ORDER BY tot DESC) SELECT o_clerk, tot FROM base FETCH FIRST 5 ROWS ONLY  Note: Disregard the red ‘X’ that may still exist on the Data Sets. This is a bug and can safely be ignored. Now that we have defined the Data Source and have the Data Sets configured, run the report in Web Viewer as shown in the diagram below. The output from the web viewer against the orders table on Big SQL should be as follows. 46
  • 47. As seen in this part of the lab, a variety of IBM and non-IBM software that supports JDBC and ODBC data sources can also be configured to work with Big SQL. We used BIRT here, but as another example, Cognos Business Intelligence can uses Big SQL's JDBC interface to query data, generate reports, and perform other analytical functions. Similarly, other tools like Tableau can leverage Big SQL’s ODBC drivers to work with data stored in a Big Insights cluster. 47
  • 48. Communities • On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more o Find the community that interests you … • Information Management bit.ly/InfoMgmtCommunity • Business Analytics bit.ly/AnalyticsCommunity • Enterprise Content Management bit.ly/ECMCommunity • IBM Champions o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities • ibm.com/champion Thank You! Your Feedback is Important! • Access the Conference Agenda Builder to complete your session surveys o Any web or mobile browser at http://iod13surveys.com/surveys.html o Any Agenda Builder kiosk onsite 48
  • 49. Acknowledgements and Disclaimers: Availability: References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2013. All rights reserved. • U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. 49