Follow this hands-on lab to discover how Spark programmers can work with data managed by Big SQL, IBM's SQL interface for Hadoop. Examples use Scala and the Spark shell in a BigInsights 4.3 technical preview 2 environment.
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Big Data: Working with Big SQL data from Spark
1. Using Spark with Big SQL
Cynthia M. Saracco
IBM Solution Architect
April 12, 2017
2. Contents
LAB 1 OVERVIEW......................................................................................................................................................... 4
1.1. WHAT YOU'LL LEARN ................................................................................................................................ 4
1.2. PRE-REQUISITES ..................................................................................................................................... 4
1.3. ABOUT YOUR ENVIRONMENT ..................................................................................................................... 5
1.4. GETTING STARTED ................................................................................................................................... 6
LAB 2 USING SPARK TO WORK WITH BIG SQL TABLES........................................................................................ 7
2.1. CREATING AND POPULATING BIG SQL SAMPLE TABLES ................................................................................ 7
2.2. LAUNCH JSQSH AND CONNECT TO YOUR DATABASE ...................................................................... 7
2.3. CREATE BIG SQL HIVE TABLES .................................................................................................. 8
2.4. CREATE A BIG SQL EXTERNALLY MANAGED TABLE........................................................................ 9
2.5. CREATE A BIG SQL TABLE IN HBASE ........................................................................................ 12
2.6. QUERYING AND MANIPULATING BIG SQL DATA THROUGH SPARK ................................................................ 13
2.7. EXPLORE THE BASICS .............................................................................................................. 14
2.8. JOIN DATA FROM MULTIPLE TABLES............................................................................................ 18
2.9. USE SPARK MLLIB TO WORK WITH BIG SQL DATA ...................................................................... 21
2.10. OPTIONAL: JOIN BIG SQL AND JSON DATA ............................................................................... 22
LAB 3 SUMMARY ....................................................................................................................................................... 32
3.
4. 4
Lab 1 Overview
This hands-on lab helps you explore how Spark programmers can work with data managed by Big SQL.
Big SQL is high-performance query engine for IBM BigInsights and the Hortonworks Data Platform.
Apache Spark, part of IBM’s Open Platform for Apache Hadoop, is a fast, general-purpose engine for
processing Big Data, including data managed by Hadoop. Particularly appealing to many Spark
programmers are built-in and third-party libraries for machine learning, streaming, SQL, and more.
Given the popularity of both Big SQL and Spark, organizations may want to deploy and use both
technologies. This lab introduces you to one way in which you can integrate these technologies –
namely, by using Spark SQL and its support for JDBC-enabled data sources to manipulate Big SQL
tables in Hive, HBase, or arbitrary HDFS directories. It’s worth noting that other forms of integration are
possible. For example, Big SQL programmers can launch Spark jobs from within their Big SQL queries
and integrate results returned from Spark with Big SQL data. However, such integration is beyond the
scope of this lab. Consult the Big SQL production documentation for further details, if interested.
1.1. What you'll learn
After completing all exercises in this lab, you'll know how to
• Work with data in Big SQL tables using the Spark shell.
• Create and populate Spark DataFrames with data from Big SQL tables.
• Query and join data from Big SQL tables using Spark SQL.
• Use a simple Spark ML (machine learning) function to operate on Big SQL data.
• Optionally, use Spark SQL to join Big SQL data with complex JSON data in HDFS.
Allow 1 – 2 hours to complete this lab. Special thanks to Dan Kikuchi for reviewing this material.
1.2. Pre-requisites
Prior to beginning this lab, you will need access to a BigInsights 4.3 environment as described in the
subsequent section. In addition, you should be familiar with the basics of Big SQL and Spark. Labs 1 –
4 of Getting Started with Big SQL (http://www.slideshare.net/CynthiaSaracco/big-sql40-hol) and the lab
on lab on Working with HBase and Big SQL (https://www.slideshare.net/CynthiaSaracco/h-base-
introlabv4) will help you become familiar with the fundamentals of Big SQL needed for this lab. The
Apache Spark web site (https://spark.apache.org/) contains various resources to help you become
familiar with Spark.
5. 5
1.3. About your environment
This lab requires a BigInsights 4.3 environment in which Big SQL, JSqsh (a SQL command-line
interface), HBase, and Spark are installed and running. Big SQL and JSqsh are part of IBM BigInsights.
Spark is part of the IBM Open Platform for Apache Hadoop upon which BigInsights is based.
Examples in this lab were tested on a 4-node test cluster running BigInsights 4.3 technical preview 2 with
Spark 2.1; the specific configuration of this cluster is outlined in the following table. If your environment
is different, modify the sample code and instructions as needed to match your configuration.
User Password
Root account root password
Big SQL Administrator bigsql bigsql
Ambari Administrator admin admin
Property Value
Host name myhost-master.fyre.ibm.com
Ambari port number 8080
Big SQL database name bigsql
Big SQL port number 32051
Big SQL installation directory /usr/ibmpacks/bigsql
Big SQL JDBC driver /usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar
JSqsh installation directory /usr/ibmpacks/common-utils/current/jsqsh
Big SQL samples directory /usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data
Spark version 2.1
Spark client /usr/iop/current/spark2-client
About the screen captures, sample code, and environment configuration
Screen captures in this lab depict examples and results that may vary from what you
see when you complete the exercises. In addition, you may need to customize some
code examples to match your environment.
6. 6
1.4. Getting started
To get started with the lab exercises, you need access to a working BigInsights environment, as
described in the previous section. See https://www.ibm.com/analytics/us/en/technology/hadoop/hadoop-
trials.html for download options. Product documentation for BigInsights, including installation instructions
and sample exercises, is available at
https://www.ibm.com/support/knowledgecenter/SSPT3X_4.3.0/com.ibm.swg.im.infosphere.biginsights.w
elcome.doc/doc/welcome.html.
Before continuing with this lab, verify that Big SQL, Spark, and all of their pre-requisite services are
running. If you have any questions or need help getting your environment up and running, visit Hadoop
Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post a message to
the forum. You cannot proceed with subsequent lab exercises without access to a working environment.
7. 7
Lab 2 Using Spark to work with Big SQL tables
With your BigInsights environment running Big SQL and Spark, you’re ready to explore how to access
Big SQL data from the Spark shell. In this exercise, you will
Create and populate a few Big SQL tables
Use the Spark shell, Scala, and Spark SQL to query and join data in these tables
Invoke a Spark ML (machine learning) function over Big SQL data
Optionally, join data in JSON files with data in Big SQL tables through Spark SQL
This lab presumes you know how to launch and use Big SQL’s command-line interface (JSqsh) to
execute queries and commands. If necessary, consult the Getting Started with Big SQL lab
(http://www.slideshare.net/CynthiaSaracco/big-sql40-hol) or the BigInsights Knowledge Center
(https://www.ibm.com/support/knowledgecenter/SSPT3X_4.3.0/com.ibm.swg.im.infosphere.biginsights.w
elcome.doc/doc/welcome.html) for details.
This lab also presumes that you can access the Big SQL sample data on your local file system. The
sample data ships with Big SQL and is often located in a directory such as
/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data.
2.1. Creating and populating Big SQL sample tables
In this module, you will create and populate several Big SQL tables that you will access from Spark later.
To illustrate the breadth of Big SQL data access available to Spark programmers, your Big SQL tables
will employ different underlying storage managers. Specifically, your sample tables will store data in the
Hive warehouse, in HBase, and in an arbitrary HDFS directory.
This lab uses 3 of the more than 60 tables that comprise the sample Big SQL database, which employs a
star schema design (FACT and DIMENSION tables) typical of a relational data warehouse to model
sales data for various retail products. Unless otherwise indicated, examples presume you’re executing
commands using the bigsql ID.
If you’ve worked through other publicly available Big SQL labs, you may have already completed some of
the necessary work included in this module.
2.2. Launch JSqsh and connect to your database
__1. Launch JSqsh, the Big SQL command line interface. For example, in my environment, I entered
/usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh
__2. Connect to your Big SQL database. If necessary, launch the JSqsh connection wizard to create
a connection.
setup connections
8. 8
2.3. Create Big SQL Hive tables
__3. Create two tables in the Hive warehouse using your default schema (which will be "bigsql" if you
connected into your database as that user). The first table is part of the PRODUCT dimension
and includes information about product lines in different languages. The second table is the
sales FACT table, which tracks transactions (orders) of various products.
-- look up table with product line info in various languages
CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup
( product_line_code INT NOT NULL
, product_line_en VARCHAR(90) NOT NULL
, product_line_de VARCHAR(90), product_line_fr VARCHAR(90)
, product_line_ja VARCHAR(90), product_line_cs VARCHAR(90)
, product_line_da VARCHAR(90), product_line_el VARCHAR(90)
, product_line_es VARCHAR(90), product_line_fi VARCHAR(90)
, product_line_hu VARCHAR(90), product_line_id VARCHAR(90)
, product_line_it VARCHAR(90), product_line_ko VARCHAR(90)
, product_line_ms VARCHAR(90), product_line_nl VARCHAR(90)
, product_line_no VARCHAR(90), product_line_pl VARCHAR(90)
, product_line_pt VARCHAR(90), product_line_ru VARCHAR(90)
, product_line_sc VARCHAR(90), product_line_sv VARCHAR(90)
, product_line_tc VARCHAR(90), product_line_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
-- fact table for sales
CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact
( order_day_key INT NOT NULL
, organization_key INT NOT NULL
, employee_key INT NOT NULL
, retailer_key INT NOT NULL
, retailer_site_key INT NOT NULL
, product_key INT NOT NULL
, promotion_key INT NOT NULL
, order_method_key INT NOT NULL
, sales_order_key INT NOT NULL
, ship_day_key INT NOT NULL
, close_day_key INT NOT NULL
, quantity INT
, unit_cost DOUBLE
, unit_price DOUBLE
, unit_sale_price DOUBLE
, gross_margin DOUBLE
, sale_total DOUBLE
, gross_profit DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
;
9. 9
__4. Load data into each of these tables using sample files provided with Big SQL. Change the FILE
URL specification in each of the following examples to match your environment. Then,
one at a time, issue each LOAD statement and verify that the operation completed successfully.
LOAD returns a warning message providing details on the number of rows loaded, etc.
load hadoop using file url 'sftp://yourID:yourPassword@myhost-
master.fyre.ibm.com:22/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALE
SDW.SLS_PRODUCT_LINE_LOOKUP.txt' with SOURCE PROPERTIES
('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost-
master.fyre.ibm.com:22/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALE
SDW.SLS_SALES_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO
TABLE SLS_SALES_FACT overwrite;
__5. Query the tables to verify that the expected number of rows was loaded into each table. Execute
each query below and compare the results with the number of rows specified in the comment
line preceding each query.
-- total rows in SLS_PRODUCT_LINE_LOOKUP = 5
select count(*) from bigsql.SLS_PRODUCT_LINE_LOOKUP;
-- total rows in SLS_SALES_FACT = 446023
select count(*) from bigsql.SLS_SALES_FACT;
__6. Open a terminal window.
__7. Enable public access to the sample tables in your Hive warehouse.
From the command line, issue this command to switch to the root user ID temporarily:
su root
When prompted, enter the password for this account. Then switch to the hdfs ID.
su hdfs
While logged in as user hdfs, issue this command to provide public access to all Hive warehouse
tables:
hdfs dfs -chmod -R 777 /apps/hive/warehouse
2.4. Create a Big SQL externally managed table
Now that you have 2 Big SQL tables in the Hive warehouse, it's time to create an externally managed Big
SQL table – i.e., a table created over a user directory that resides outside of the Hive warehouse. This
user directory will contain the table’s data in files. Creating such a table effectively layers a SQL schema
over existing HDFS data (or data that you may later upload into the target HDFS directory).
__8. From your terminal window, check the directory permissions for HDFS.
10. 10
hdfs dfs -ls /
If the /user directory cannot be written by the public (as shown in the example above), you will
need to change these permissions so that you can create the necessary subdirectories for this
lab.
While logged in as user hdfs, issue this command:
hdfs dfs -chmod 777 /user
Next, confirm the effect of your change:
hdfs dfs -ls /
Exit the hdfs user account:
exit
Finally, exit the root user account and return to the standard account you’ll be using for this lab
(e.g., bigsql):
exit
__9. Create a directory structure in your distributed file system for the source data file for the product
dimension table. (If desired, alter the HDFS information as appropriate for your environment.)
hdfs dfs -mkdir /user/bigsql_spark_lab
hdfs dfs -mkdir /user/bigsql_spark_lab/sls_product_dim
__10. Upload the source data file (the Big SQL sample data file named
GOSALESDW.SLS_PRODUCT_DIM.txt) into the target DFS directory. Change the local and DFS
directories information below to match your environment.
11. 11
hdfs dfs -copyFromLocal
/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt
/user/bigsql_spark_lab/sls_product_dim/SLS_PRODUCT_DIM.txt
__11. List the contents of the HDFS directory and verify that your sample data file is present.
hdfs dfs -ls /user/bigsql_spark_lab/sls_product_dim
__12. Ensure public access to this lab's directory structure.
hdfs dfs -chmod -R 777 /user/bigsql_spark_lab
__13. Return to your Big SQL query execution environment (JSqsh) and connect to your Big SQL
database.
__14. Create an external Big SQL table for the sales product dimension (extern.sls_product_dim).
Note that the LOCATION clause references the DFS directory into which you copied the sample
data.
-- product dimension table
CREATE EXTERNAL HADOOP TABLE IF NOT EXISTS extern.sls_product_dim
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_number INT NOT NULL
, base_product_key INT NOT NULL
, base_product_number INT NOT NULL
, product_color_code INT
, product_size_code INT
, product_brand_key INT NOT NULL
, product_brand_code INT NOT NULL
, product_image VARCHAR(60)
, introduction_date TIMESTAMP
, discontinued_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
location '/user/bigsql_spark_lab/sls_product_dim';
12. 12
If you encounter a SQL -5105 error message such as the one shown below, the HDFS directory
permissions for your target directory (e.g., /user/bigsql_spark_lab) may be too restrictive.
From an OS terminal window, issue this command:
hdfs dfs -ls /user/bigsql_spark_lab
Your permissions must include rw settings. Consult the earlier steps in this lab for instructions
on how to reset HDFS directory permissions.
__15. Verify that you can query the table.
-- total rows in EXTERN.SLS_PRODUCT_DIM = 274
select count(*) from EXTERN.SLS_PRODUCT_DIM;
2.5. Create a Big SQL table in HBase
Finally, create and populate a Big SQL table managed by HBase. Specifically, create a table that joins
rows from one of your Big SQL Hive tables (bigsql.sls_product_line_lookup) with rows in your
externally managed Big SQL table (extern.sls_product_dim). Your HBase table will effectively
“flatten” (or de-normalize) content in these tables into a structure that’s more efficient for processing in
HBase.
__16. Create a Big SQL table named sls_product_flat managed by HBase and populate this table
with the results of a query spanning two Big SQL tables that you created previously.
CREATE hbase TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
column mapping
(
key mapped by (product_key),
data:c2 mapped by (product_line_code),
data:c3 mapped by (product_type_key),
data:c4 mapped by (product_type_code),
data:c5 mapped by (product_line_en),
data:c6 mapped by (product_line_de)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, bigsql.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;
13. 13
This statement creates a Big SQL table named SLS_PRODUCT_FLAT in the current user schema. The
COLUMN MAPPING clause specifies how SQL columns are to be mapped to HBase columns in column
families. For example, the SQL PRODUCT_KEY column maps to the HBase row key, while other SQL
columns are mapped to various columns within the HBase data column family. For more details about
Big SQL HBase support, including column mappings, see the separate lab on Working with HBase and
Big SQL (https://www.slideshare.net/CynthiaSaracco/h-base-introlabv4) or consult the product
documentation.
__17. Verify that 274 rows are present in your Big SQL HBase table.
select count(*) from bigsql.sls_product_flat;
+-----+
| 1 |
+-----+
| 274 |
+-----+
__18. Optionally, query the table to become familiar with its contents.
select product_key, product_line_code, product_line_en from bigsql.sls_product_flat
where product_key > 30270;
+-------------+-------------------+----------------------+
| PRODUCT_KEY | PRODUCT_LINE_CODE | PRODUCT_LINE_EN |
+-------------+-------------------+----------------------+
| 30271 | 993 | Personal Accessories |
| 30272 | 993 | Personal Accessories |
| 30273 | 993 | Personal Accessories |
| 30274 | 993 | Personal Accessories |
+-------------+-------------------+----------------------+
4 rows in results(first row: 0.106s; total: 0.108s)
2.6. Querying and manipulating Big SQL data through Spark
Now that you’ve created and populated the sample Big SQL tables required for this lab, it’s time to
experiment with accessing them. In this module, you’ll launch the Spark shell and issue Scala commands
and expressions to retrieve Big SQL data. Specifically, you’ll use Spark SQL to query data in Big SQL
tables. You’ll model the result sets from your queries as DataFrames, the Spark equivalent of a table. For
details on DataFrames, visit the Spark web site.
14. 14
2.7. Explore the basics
In this module, you’ll learn how to use Spark’s support for JDBC data sources to access data in a single
Big SQL table.
__1. From a terminal window, launch the Spark shell using the –driver-class-path option to
specify the location of the Big SQL JDBC driver class (db2jcc4.jar) in your environment.
Adjust the directory information below for the Spark shell and Big SQL .jar file to match
your environment.
/usr/iop/current/spark2-client/bin/spark-shell --driver-class-path
/usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar
In this section and others that follow, highlighting designates sample output generated from
commands you enter. (Commands are not highlighted and appear just before sample output.)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://xxx.yy.zzz.164:4040
Spark context available as 'sc' (master = local[*], app id = local-1490640789034).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
__2. From the Spark shell, import classes that you’ll need for subsequent lab work.
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
15. 15
__3. Load data from your sls_product_line_lookup table (in Hive) into a DataFrame named
lookupDF. Adjust the JDBC specification (option values) below as needed to match your
environment.
val lookupDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_line_lookup"
).option("user", "bigsql").option("password", "bigsql").load()
scala> val lookupDF = spark.read.format("jdbc").option("url",
"jdbc:db2://xxxx.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_line_lookup
").option("user", "bigsql").option("password", "bigsql").load()
lookupDF: org.apache.spark.sql.DataFrame = [PRODUCT_LINE_CODE: int, PRODUCT_LINE_EN:
string ... 22 more fields]
As you’ll note, connection property values are hard-coded in this example to make it easy for you
to follow. For production use, you may prefer a more flexible approach through the use of
java.util.Properties; see the Spark API documentation for details.
__4. Display the contents of your DataFrame. The number of columns in your data set, coupled with
the length of many of the string columns, may make the output format of this show() operation
difficult to read on your screen.
lookupDF.show()
scala> lookupDF.show()
+-----------------+--------------------+--------------------+--------------------+----------
-----+--------------------+-------------------+--------------------+--------------------+---
-----------------+--------------------+--------------------+--------------------+-----------
----+--------------------+--------------------+--------------------+--------------------+---
-----------------+--------------------+---------------+--------------------+---------------
+--------------------+
|PRODUCT_LINE_CODE| PRODUCT_LINE_EN| PRODUCT_LINE_DE|
PRODUCT_LINE_FR|PRODUCT_LINE_JA| PRODUCT_LINE_CS| PRODUCT_LINE_DA|
PRODUCT_LINE_EL| PRODUCT_LINE_ES| PRODUCT_LINE_FI| PRODUCT_LINE_HU|
PRODUCT_LINE_ID| PRODUCT_LINE_IT|PRODUCT_LINE_KO| PRODUCT_LINE_MS|
PRODUCT_LINE_NL| PRODUCT_LINE_NO| PRODUCT_LINE_PL| PRODUCT_LINE_PT|
PRODUCT_LINE_RU|PRODUCT_LINE_SC| PRODUCT_LINE_SV|PRODUCT_LINE_TC| PRODUCT_LINE_TH|
+-----------------+--------------------+--------------------+--------------------+----------
-----+--------------------+-------------------+--------------------+--------------------+---
-----------------+--------------------+--------------------+--------------------+-----------
----+--------------------+--------------------+--------------------+--------------------+---
-----------------+--------------------+---------------+--------------------+---------------
+--------------------+
| 991| Camping Equipment| Campingausrüstung| Matériel de camping|
キャンプ用品|Vybavení pro kemp...| Campingudstyr|Εξοπλισμός κατασκ...| Equipo de
acampada| Retkeilyvarusteet| Kempingfelszerelés|Perlengkapan Berk...|Attrezzatura per
...| 캠핑 장비|Kelengkapan Berkh...|Kampeerbenodigdheden|
Campingutstyr|Ekwipunek kempingowy|Equipamento acamp...|Снаряжение для ту...|
16. 16
露营装备| Campingutrustning| 露營器材| อุปกรณ์ตั้งแคมป์ |
| 992|Mountaineering Eq...|Bergsteigerausrüs...|Matériel de montagne|
登山用品|Horolezecké vybavení| Alpint udstyr|Εξοπλισμός ορειβα...|Equipo de
montañismo|Vuorikiipeilyvaru...|Hegymászó-felszer...|Perlengkapan Pend...|Attrezzatura per
...| 등산 장비|Kelengkapan Menda...| Bergsportartikelen| Klatreutstyr|
Sprzęt wspinaczkowy|Equipamento monta...| Горное снаряжение| 登山装备|
Klätterutrustning| 登山器材| อุปกรณ์ปีนเขา|
| 993|Personal Accessories| Accessoires|Accessoires perso...|
個人装備| Věci osobní potřeby|Personligt tilbehør| Προσωπικά είδη|Accesorios
person...|Henkilökohtaiset ...|Személyes kiegész...| Aksesori pribadi| Accessori
personali| 개인 용품| Aksesori Diri|Persoonlijke acce...|Personlig utrustning|
Akcesoria osobiste| Acessórios pessoais|Личные принадлежн...| 个人附件|Personliga
tillbehör| 個人配件| อุปกรณ์ส่วนตัว|
| 994| Outdoor Protection|Outdoor-Schutzaus...|Articles de prote...|
アウトドア用保護用品| Vybavení do přírody|Udendørsbeskyttelse|Προστασία για την...|Protección
aire l...| Ulkoiluvarusteet| Védőfelszerelés|Perlindungan Luar...|Protezione
personale| 야외 보호 장비|Perlindungan Luar...|Buitensport - pre...|Utendørs
beskyttelse|Wyposażenie ochronne| Proteção ar livre| Средства защиты|
户外防护用品| Skyddsartiklar| 戶外防護器材|สิ่งป้องกันเมื่ออ...|
| 995| Golf Equipment| Golfausrüstung| Matériel de golf|
ゴルフ用品| Golfové potřeby| Golfudstyr| Εξοπλισμός γκολφ| Equipo de
golf| Golf-varusteet| Golffelszerelés| Perlengkapan Golf|Attrezzatura da golf|
골프 장비| Kelengkapan Golf| Golfartikelen| Golfutstyr| Ekwipunek
golfowy| Equipamento golfe|Снаряжение для го...| 高尔夫球装备| Golfutrustning|
高爾夫球器材| อุปกรณ์กอล์ฟ|
+-----------------+--------------------+--------------------+--------------------+----------
-----+--------------------+-------------------+--------------------+--------------------+---
-----------------+--------------------+--------------------+--------------------+-----------
----+--------------------+--------------------+--------------------+--------------------+---
-----------------+--------------------+---------------+--------------------+---------------
+--------------------+
__5. If desired, experiment with other operations you can perform on DataFrames to inspect their
contents, consulting the Spark online documentation as needed. The example below collects
each row in the DataFrame into an Array and prints the contents of each array element.
lookupDF.collect().foreach(println)
scala> lookupDF.collect().foreach(println)
[991,Camping Equipment,Campingausrüstung,Matériel de camping,キャンプ用品,Vybavení
pro kempování,Campingudstyr,Εξοπλισμός κατασκήνωσης,Equipo de
acampada,Retkeilyvarusteet,Kempingfelszerelés,Perlengkapan Berkemah,Attrezzatura per
campeggio,캠핑 장비,Kelengkapan
Berkhemah,Kampeerbenodigdheden,Campingutstyr,Ekwipunek kempingowy,Equipamento
acampamento,Снаряжение для туризма,露营装备,Campingutrustning,露營器材,อุปกรณ์ตั้งแคมป์ ]
[992,Mountaineering Equipment,Bergsteigerausrüstung,Matériel de
montagne,登山用品,Horolezecké vybavení,Alpint udstyr,Εξοπλισμός ορειβασίας,Equipo de
17. 17
montañismo,Vuorikiipeilyvarusteet,Hegymászó-felszerelés,Perlengkapan Pendaki
Gunung,Attrezzatura per alpinismo,등산 장비,Kelengkapan Mendaki
Gunung,Bergsportartikelen,Klatreutstyr,Sprzęt wspinaczkowy,Equipamento
montanhismo,Горное снаряжение,登山装备,Klätterutrustning,登山器材,อุปกรณ์ปีนเขา]
[993,Personal Accessories,Accessoires,Accessoires personnels,個人装備,Věci osobní
potřeby,Personligt tilbehør,Προσωπικά είδη,Accesorios personales,Henkilökohtaiset
tarvikkeet,Személyes kiegészítők,Aksesori pribadi,Accessori personali,개인
용품,Aksesori Diri,Persoonlijke accessoires,Personlig utrustning,Akcesoria
osobiste,Acessórios pessoais,Личные принадлежности,个人附件,Personliga
tillbehör,個人配件,อุปกรณ์ส่วนตัว]
[994,Outdoor Protection,Outdoor-Schutzausrüstung,Articles de
protection,アウトドア用保護用品,Vybavení do přírody,Udendørsbeskyttelse,Προστασία για
την ύπαιθρο,Protección aire libre,Ulkoiluvarusteet,Védőfelszerelés,Perlindungan Luar
Ruang,Protezione personale,야외 보호 장비,Perlindungan Luar Bangunan,Buitensport -
preventie,Utendørs beskyttelse,Wyposażenie ochronne,Proteção ar livre,Средства
защиты,户外防护用品,Skyddsartiklar,戶外防護器材,สิ่งป้องกันเมื่ออยู่กลางแจ ้ง]
[995,Golf Equipment,Golfausrüstung,Matériel de golf,ゴルフ用品,Golfové
potřeby,Golfudstyr,Εξοπλισμός γκολφ,Equipo de golf,Golf-
varusteet,Golffelszerelés,Perlengkapan Golf,Attrezzatura da golf,골프
장비,Kelengkapan Golf,Golfartikelen,Golfutstyr,Ekwipunek golfowy,Equipamento
golfe,Снаряжение для гольфа,高尔夫球装备,Golfutrustning,高爾夫球器材,อุปกรณ์กอล์ฟ]
__6. To query the DataFrame, create a temporary view of it.
lookupDF.createOrReplaceTempView("lookup")
__7. Use Spark SQL to query your temporary view and display the results. The following example
returns English and French product line information for all product line codes below 995.
spark.sql("select product_line_code, product_line_en, product_line_fr from lookup
where product_line_code < 995").show()
scala> spark.sql("select product_line_code, product_line_en, product_line_fr from
lookup where product_line_code < 995").show()
+-----------------+--------------------+--------------------+
|product_line_code| product_line_en| product_line_fr|
+-----------------+--------------------+--------------------+
| 991| Camping Equipment| Matériel de camping|
| 992|Mountaineering Eq...|Matériel de montagne|
| 993|Personal Accessories|Accessoires perso...|
| 994| Outdoor Protection|Articles de prote...|
+-----------------+--------------------+--------------------+
Note that string values in the last two columns of this result were truncated due to defaults
associated with show().
18. 18
__8. If desired, modify the previous command slightly to display more content in the final two columns
of your result. For example, use show(5,100) to specify that a maximum of 5 rows should be
returned and that string column values should be truncated after 100 characters.
spark.sql("select product_line_code, product_line_en, product_line_fr from lookup
where product_line_code < 995").show(5,100)
scala> spark.sql("select product_line_code, product_line_en, product_line_fr from
lookup where product_line_code < 995").show(5,100)
+-----------------+------------------------+----------------------+
|product_line_code| product_line_en| product_line_fr|
+-----------------+------------------------+----------------------+
| 991| Camping Equipment| Matériel de camping|
| 992|Mountaineering Equipment| Matériel de montagne|
| 993| Personal Accessories|Accessoires personnels|
| 994| Outdoor Protection|Articles de protection|
+-----------------+------------------------+----------------------+
2.8. Join data from multiple tables
In this module, you’ll learn how to use Spark to query data from Big SQL tables managed outside of the
Hive warehouse. In doing so, you’ll see that the underlying storage mechanism that Big SQL uses for its
tables is hidden from the Spark programmer. In other words, you don’t need to know how Big SQL is
storing the data – you simply query Big SQL tables as you would any other JDBC data source.
__9. Following the same approach that you executed in the previous module, create a DataFrame for
your Big SQL extern.sls_product_dim table. Adjust the specifications (option values)
below as needed to match your environment.
val dimDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","extern.sls_product_dim").option
("user", "bigsql").option("password", "bigsql").load()
__10. Create a temporary view so that you can query the data.
dimDF.createOrReplaceTempView("dim")
__11. Query the data set, making sure that there are 274 rows.
spark.sql("select count(*) from dim").show()
scala> spark.sql("select count(*) from dim").show()
+--------+
|count(1)|
+--------+
| 274|
+--------+
__12. Join data in your lookup and dim views and display the results. Recall that lookup loaded data
from a Big SQL Hive table while dim loaded data from a Big SQL externally managed table.
19. 19
spark.sql("select l.product_line_code, l.product_line_en, d.product_number,
d.introduction_date from lookup l, dim d where
l.product_line_code=d.product_line_code limit 15").show()
scala> spark.sql("select l.product_line_code, l.product_line_en, d.product_number,
d.introduction_date from lookup l, dim d where
l.product_line_code=d.product_line_code limit 15").show()
+-----------------+---------------+--------------+--------------------+
|product_line_code|product_line_en|product_number| introduction_date|
+-----------------+---------------+--------------+--------------------+
| 995| Golf Equipment| 101110|2003-12-15 00:00:...|
| 995| Golf Equipment| 102110|2003-12-10 00:00:...|
| 995| Golf Equipment| 103110|2003-12-10 00:00:...|
| 995| Golf Equipment| 104110|2003-12-18 00:00:...|
| 995| Golf Equipment| 105110|2003-12-27 00:00:...|
| 995| Golf Equipment| 106110|2003-12-05 00:00:...|
| 995| Golf Equipment| 107110|2004-01-13 00:00:...|
| 995| Golf Equipment| 108110|2003-12-27 00:00:...|
| 995| Golf Equipment| 109110|2003-12-10 00:00:...|
| 995| Golf Equipment| 110110|2003-12-10 00:00:...|
| 995| Golf Equipment| 111110|2003-12-15 00:00:...|
| 995| Golf Equipment| 112110|2004-01-10 00:00:...|
| 995| Golf Equipment| 113110|2004-01-15 00:00:...|
| 995| Golf Equipment| 114110|2003-12-15 00:00:...|
| 995| Golf Equipment| 115110|2003-12-27 00:00:...|
+-----------------+---------------+--------------+--------------------+
__13. Next, explore how to query a Big SQL HBase table. Create a DataFrame for your
sls_product_fact table. Adjust the specification (option values) below as needed to
match your environment.
val hbaseDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_flat").optio
n("user", "bigsql").option("password", "bigsql").load()
__14. Create a temporary view of your DataFrame.
hbaseDF.createOrReplaceTempView("flat")
20. 20
__15. Query this view.
spark.sql("select * from flat where product_key > 30270").show()
scala> spark.sql("select * from flat where product_key > 30270").show()
+-----------+-----------------+----------------+-----------------+--------------------+---------------+
|PRODUCT_KEY|PRODUCT_LINE_CODE|PRODUCT_TYPE_KEY|PRODUCT_TYPE_CODE| PRODUCT_LINE_EN|PRODUCT_LINE_DE|
+-----------+-----------------+----------------+-----------------+--------------------+---------------+
| 30271| 993| 960| 960|Personal Accessories| Accessoires|
| 30272| 993| 960| 960|Personal Accessories| Accessoires|
| 30273| 993| 960| 960|Personal Accessories| Accessoires|
| 30274| 993| 960| 960|Personal Accessories| Accessoires|
+-----------+-----------------+----------------+-----------------+--------------------+---------------+
__16. To prepare to join data from your Big SQL HBase table with data from a Big SQL Hive table,
create a DataFrame for the sales fact table. Adjust the specification (option values) below
as needed to match your environment.
val factDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_sales_fact").option(
"user", "bigsql").option("password", "bigsql").load()
__17. Create a view of this DataFrame.
factDF.createOrReplaceTempView("fact")
__18. Join data in your temporary fact and hbase views. These views are based on the Big SQL
sales fact table (sls_sales_fact) managed by Hive and the Big SQL dimension table
(sls_product_flat) managed by HBase.
spark.sql("select f.retailer_key, f.product_key, h.product_line_en, f.sale_total
from fact f, flat h where f.product_key = h.product_key and sale_total > 50000 limit
10").show()
21. 21
scala> spark.sql("select f.retailer_key, f.product_key, h.product_line_en,
f.sale_total from fact f, flat h where f.product_key = h.product_key and sale_total
> 50000 limit 10").show()
+------------+-----------+---------------+----------+
|retailer_key|product_key|product_line_en|sale_total|
+------------+-----------+---------------+----------+
| 6870| 30128| Golf Equipment| 60812.38|
| 6875| 30128| Golf Equipment| 79123.29|
| 6872| 30128| Golf Equipment| 61316.35|
| 6875| 30128| Golf Equipment| 79291.28|
| 6875| 30128| Golf Equipment| 77275.4|
| 6875| 30128| Golf Equipment| 65348.11|
| 6871| 30128| Golf Equipment| 51740.92|
| 6875| 30128| Golf Equipment| 71059.77|
| 6875| 30128| Golf Equipment| 79627.26|
| 7154| 30128| Golf Equipment| 58460.52|
+------------+-----------+---------------+----------+
2.9. Use Spark MLlib to work with Big SQL data
Now that you understand how to use Spark SQL to work Big SQL tables, let’s explore how to use other
Spark technologies to manipulate Big SQL data. In this exercise, you’ll use a simple function in Spark’s
machine learning library (MLlib) to transform a DataFrame built from data in one or more Big SQL tables.
Quite often, transformations are needed before using more sophisticated analytical functions available
through MLlib and other libraries.
__19. Import Spark MLlib classes that you’ll be using shortly.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
__20. Create a DataFrame for new query results that involve a Spark SQL temporary view you created
earlier for the Big SQL sales fact table. Your new DataFrame will serve as input to a subsequent
transformation operation.
val test=sql("select quantity, unit_cost, sale_total, gross_profit from fact")
__21. Create a VectorAssembler that combines data in a list of columns into a single vector column. In
this example, two columns (quantity and unit_cost) will be transformed into a new vector
column named data.
val assembler = new
VectorAssembler().setInputCols(Array("quantity","unit_cost")).setOutputCol("data")
__22. Execute the transformation operation and display the results. Note that this form of show()
displays only the top 20 records of the results.
val output = assembler.transform(test).show()
22. 22
scala> val output = assembler.transform(test).show()
+--------+------------------+----------+------------+--------------------+
|quantity| unit_cost|sale_total|gross_profit| data|
+--------+------------------+----------+------------+--------------------+
| 587| 34.9| 41958.76| 21472.46| [587.0,34.9]|
| 214| 91.8| 35949.86| 16304.66| [214.0,91.8]|
| 576| 2.67| 5921.28| 4383.36| [576.0,2.67]|
| 129| 56.23| 11570.01| 4316.34| [129.0,56.23]|
| 1776|1.8599999999999999| 10123.2| 6819.84|[1776.0,1.8599999...|
| 1822| 1.79| 8654.5| 5393.12| [1822.0,1.79]|
| 412| 9.1| 9191.72| 5442.52| [412.0,9.1]|
| 67| 690.0| 79200.7| 32970.7| [67.0,690.0]|
| 97| 238.88| 41543.16| 18371.8| [97.0,238.88]|
| 1172| 6.62| 10278.44| 2519.8| [1172.0,6.62]|
| 591| 34.97| 30838.38| 10171.11| [591.0,34.97]|
| 338| 85.11| 40776.32| 12009.14| [338.0,85.11]|
| 97| 426.0| 61075.08| 19753.08| [97.0,426.0]|
| 364| 86.0| 49704.2| 18400.2| [364.0,86.0]|
| 234| 65.25| 22737.78| 7469.28| [234.0,65.25]|
| 603| 16.0| 19446.75| 9798.75| [603.0,16.0]|
| 232| 18.0| 6625.92| 2449.92| [232.0,18.0]|
| 450| 18.05| 15075.0| 6952.5| [450.0,18.05]|
| 257| 20.0| 7864.2| 2724.2| [257.0,20.0]|
| 191| 15.62| 6434.79| 3451.37| [191.0,15.62]|
+--------+------------------+----------+------------+--------------------+
only showing top 20 rows
2.10. Optional: Join Big SQL and JSON data
In this module, you’ll explore how to use Spark to query JSON data and join that data with data from a
Big SQL table. Spark SQL enables programmers to query data in a variety of popular formats without
manually defining tables or mapping schemas to tables. Since JSON is a very popular data format, you
may find it convenient to use Spark SQL’s built-in JSON support to query data in JSON files. And since
Big SQL is a popular target for managing “cold” corporate data extracted from a relational data
warehouse, you may occasionally need to join JSON and Big SQL data.
To begin, collect some JSON data. Examples in this lab are based on 10-day weather forecasts
generated by The Weather Company’s limited-use free service on Bluemix. For details about this
service, log into Bluemix at http://bluemix.net and visit
https://console.ng.bluemix.net/catalog/services/weather-company-data?taxonomyNavigation=apps. If
using the Bluemix weather service to generate JSON data for your work, follow the instructions below. If
you wish to use different JSON data, skip the remainder of this section.
__1. Log into Bluemix (or register for a free account, if needed).
__2. Search the catalog for “weather” services.
23. 23
__3. Once you locate The Weather Company’s service, follow the standard Bluemix procedure to
create an instance of this service. Consult Bluemix online documentation, if needed, to perform
this task.
__4. After creating your weather service, access its APIs. Consult Bluemix online documentation, if
needed, to perform this task.
__5. From the listing of weather service APIs, select the Daily Forecast API.
__6. Click the service for the 10-day daily forecast by geocode.
24. 24
__7. Scroll through the displayed pages to become familiar with the details of this forecast service.
Note that you can customize input parameters to control the location of the forecast, the units of
measure for the data (e.g., metric) and the language of text data (e.g., English). Accept all
default values and proceed to the bottom of the page. Locate and click the Try it out! button.
__8. When prompted for a username and password, enter the information supplied for your service’s
credentials.
25. 25
The appropriate user name and password are included in the service credentials section of your
service’s main page. Do not enter your Bluemix ID and password; these will be rejected.
You must enter the username and password that were generated for your service when it was
created. If necessary, return to the main menu for your service and click on the Services
Credentials link to expose this information. Consult Bluemix documentation for details.
__9. Inspect the results; a subset is shown here:
26. 26
__10. Review the structure of the JSON data returned by this service, noting that it contains multiple
levels of nesting. Top-level objects represent metadata (such as the language used in
subsequent forecasts, the longitude and latitude of where the forecast apply, etc.) and weather
forecasts. Forecasts contain an array of JSON objects that detail the minimum and maximum
temperatures, the local time for which the forecast is valid, etc. Also included in each forecast
array element are separate night and day forecasts, which contain further details. Keep the
structure of this JSON data in mind, as it dictates the syntax of your queries in subsequent
exercises.
__11. Note that the weather data returned in the Response Body section of this web page splits
information across multiple lines. You need to store the data without carriage returns or line
feeds. To do so, copy the URL displayed in the Request URL section of the web page and paste
it into a new tab on your browser. (Writing an application that calls the REST APIs is more
appropriate for production. However, I wanted to give you a quick way to collect some data.)
__12. Inspect the results, noting the change in displayed format.
27. 27
__13. Copy the contents of this data and paste it into a file on your local file system.
__14. Optionally, repeat the process for another day or alter the geocode input parameter (longitude,
latitude) to collect data about a different location. Store the results in a different file so that you
will have at least 2 different 10-day forecasts to store in BigInsights. The weather API service
includes a Parameters section that you can use to alter the geocode.
__15. Use FTP or SFTP to transfer your weather data file(s) to a local file system for your BigInsights
cluster.
__16. Open a terminal window for your BigInsights cluster.
Issue an HDFS shell command to create a subdirectory within HDFS for test purposes. I created
a /user/saracco/weather subdirectory, as shown below. Modify the commands as needed
for your environment.
hdfs dfs -mkdir /user/saracco
hdfs dfs -mkdir /user/saracco/weather
__17. Copy the file(s) from your local directory to your new HDFS subdirectory. Adjust this command
as needed for your environment:
hdfs dfs -copyFromLocal weather*.* /user/saracco/weather
__18. Change permissions on your subdirectory and its contents. For example:
hdfs dfs -chmod -R 777 /user/saracco/weather
__19. List the contents of your HDFS subdirectory to validate your work, verifying that the files you
uploaded are present and that permissions provide for global access.
hdfs dfs -ls /user/saracco/weather
28. 28
Found 2 items
-rwxrwxrwx 3 saracco bihdfs 30677 2017-04-10 20:56 /user/saracco/weather/
weather10dayApr4.json
-rwxrwxrwx 3 saracco bihdfs 29655 2017-04-10 20:56 /user/saracco/weather/
weather10dayFeb17.json
After you’ve collected some JSON data and uploaded it to HDFS, create and populate a suitable Big
SQL table to be joined with the JSON data. In this section, you’ll create a geolookup table that contains
longitude, latitude, and location information. The longitude and latitude columns will serve as join keys in
a subsequent exercise. If you plan to use different JSON data for your work, modify the table definition
and INSERT statements as needed to create a suitable table.
__20. Launch JSqsh and connect to your Big SQL database.
__21. Create a Big SQL table.
create hadoop table geolookup(longitude decimal(10,7), latitude decimal(10,7),
location varchar(30));
__22. Insert rows into your Big SQL table. For join key columns, ensure that at least some rows
contain data that will match your JSON data. For example, if you’re using weather forecasts
generated by the Bluemix service, inspect the contents of your JSON weather data to determine
appropriate data values for longitude and latitude, taking care to include to proper precision.
This example inserts 3 rows into the geolookup table; 2 of these rows contain longitude / latitude
data that matches JSON weather forecast data I collected earlier.
insert into geolookup values (84.5, 37.17, 'Xinjiang, China');
insert into geolookup values (-121.75, 37.17, 'San Jose, CA USA');
insert into geolookup values (-73.990246,40.730171,'IBM Astor Place, NYC USA');
__23. Exit JSqsh.
Now that you have sample JSON data in HDFS and a suitable Big SQL table, it’s time to query your
data. Let’s start with the JSON data. If you’re not using 10-day weather forecasts for your sample JSON
data, you’ll need to modify some of the instructions below to match your data.
__24. If you don’t already have an open window with the Spark shell launched, launch it now.
Remember to include the appropriate Big SQL JDBC driver information at launch. Adjust the
specification below as needed to match your environment.
spark-shell --driver-class-path /usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar
__25. From the Spark shell, define a path to point to your JSON data set. Adjust the specification
below as needed to match your environment.
val path = "/user/saracco/weather"
scala> val path = "/user/saracco/weather"
path: String = /user/saracco/weather
29. 29
__26. Define a new DataFrame that will read JSON data from this path.
val weatherDF = spark.read.json(path)
scala> val weatherDF = spark.read.json(path)
17/04/03 11:40:59 WARN Utils: Truncated the string representation of a plan since it
was too large. This behavior can be adjusted by setting
'spark.debug.maxToStringFields' in SparkEnv.conf.
weatherDF: org.apache.spark.sql.DataFrame = [forecasts:
array<struct<blurb:string,blurb_author:string,class:string,day:struct<accumulation_p
hrase:string,alt_daypart_name:string,clds:bigint,day_ind:string,daypart_name:string,
fcst_valid:bigint,fcst_valid_local:string,golf_category:string,golf_index:bigint,hi:
bigint,icon_code:bigint,icon_extd:bigint,log_daypart_name:string,narrative:string,nu
m:bigint,phrase_12char:string,phrase_22char:string,phrase_32char:string,pop:bigint,p
op_phrase:string,precip_type:string,qpf:double,qualifier:string,qualifier_code:strin
g,... 24 more
fields>,dow:string,expire_time_gmt:bigint,fcst_valid:bigint,fcst_valid_local:string,
lunar_phase:string,lunar_phase_code:string,lunar_phase_day:bigint,max_temp:bigint,mi
n_temp:bigint,moonrise:string,moonset:string,narrative...
__27. Optionally, print the schema of your JSON data set so you can visualize its structure. (The
screen capture below displays only a portion of the output.)
weatherDF.printSchema()
scala> weatherDF.printSchema()
root
|-- forecasts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- blurb: string (nullable = true)
| | |-- blurb_author: string (nullable = true)
| | |-- class: string (nullable = true)
| | |-- day: struct (nullable = true)
| | | |-- accumulation_phrase: string (nullable = true)
| | | |-- alt_daypart_name: string (nullable = true)
| | | |-- clds: long (nullable = true)
| | | |-- day_ind: string (nullable = true)
| | | |-- daypart_name: string (nullable = true)
. . . .
__28. Create a temporary view of this data.
weatherDF.createOrReplaceTempView("weather")
__29. Create a new DataFrame to hold the results of a query over this view. The example below
extracts the transaction ID from the metadata associated with each JSON record. (My test
scenario contains 2 JSON records.)
val idDF = spark.sql("select metadata.transaction_id from weather")
30. 30
Apart from the JSON path information that specifies the data of interest within each record,
there’s nothing different about this SQL query from any other Spark SQL query.
__30. Display the results.
idDF.show(2,100)
scala> idDF.show(2,100)
+------------------------+
| transaction_id|
+------------------------+
|1459805077803:1340092112|
| 1455739258736:810296662|
+------------------------+
__31. Execute a more selective query that accesses deeply nested JSON objects. This example
retrieves the longitude and latitude, the valid date/time for the night’s forecast for the first
element in the forecast array (forecasts[0].night.fcst_valid_local), and the short description of that
night’s forecast (forecasts[0].night.shortcast).
sql("select metadata.longitude, metadata.latitude,
forecasts[0].night.fcst_valid_local, forecasts[0].night.shortcast from
weather").show()
scala> sql("select metadata.longitude, metadata.latitude,
forecasts[0].night.fcst_valid_local, forecasts[0].night.shortcast from
weather").show()
+---------+--------+-----------------------------------+----------------------------+
|longitude|latitude|forecasts[0].night.fcst_valid_local|forecasts[0].night.shortcast|
+---------+--------+-----------------------------------+----------------------------+
| -121.75| 37.17| 2016-04-04T19:00:...| Mainly clear|
| 84.5| 37.17| 2016-02-17T19:00:...| Partly cloudy|
+---------+--------+-----------------------------------+----------------------------+
Next, query your Big SQL data using Spark. This approach should be very familiar to you by now.
__32. Create a DataFrame for the Big SQL table (geolookup) that you intend to join with the JSON
data.
val geoDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.geolookup").option("user
", "bigsql").option("password", "bigsql").load()
__33. Create a temporary view.
geoDF.createOrReplaceTempView("geolookup")
__34. Verify that you can query this view. Given how I populated the geolookup table earlier, the
correct result from this query should be 3.
31. 31
spark.sql("select count(*) from geolookup").show()
__35. Optionally, display the contents of geolookup.
spark.sql("select * from geolookup").show(5,100)
scala> spark.sql("select * from geolookup").show(5,100)
+------------+----------+------------------------+
| LONGITUDE| LATITUDE| LOCATION|
+------------+----------+------------------------+
|-121.7500000|37.1700000| San Jose, CA USA|
| -73.9902460|40.7301710|IBM Astor Place, NYC USA|
| 84.5000000|37.1700000| Xinjiang, China|
+------------+----------+------------------------+
__36. Join the JSON weather forecast data with your Big SQL data. The first 3 columns of this query’s
result set are derived from the JSON data, while the final column’s data is derived from the Big
SQL table. As you’ll note, longitude and latitude data in both source data sets serve as the join
keys.
spark.sql("select w.metadata.longitude, w.metadata.latitude,
w.forecasts[0].narrative, g.location from weather w, geolookup g where
w.metadata.longitude=g.longitude and metadata.latitude=g.latitude").show()
scala> spark.sql("select w.metadata.longitude, w.metadata.latitude,
w.forecasts[0].narrative, g.location from weather w, geolookup g where
w.metadata.longitude=g.longitude and metadata.latitude=g.latitude").show()
+---------+--------+----------------------+----------------+
|longitude|latitude|forecasts[0].narrative| location|
+---------+--------+----------------------+----------------+
| 84.5| 37.17| Partly cloudy. Lo...| Xinjiang, China|
| -121.75| 37.17| Abundant sunshine...|San Jose, CA USA|
+---------+--------+----------------------+----------------+
32. 32
Lab 3 Summary
In this lab, you explored one way of using Spark to work with data in Big SQL tables stored in the Hive
warehouse, in HBase and in an arbitrary HDFS directory. Through Spark SQL’s support for JDBC data
sources, you queried these tables and even invoked a simple Spark MLlib transformative operation
against data in one of these tables. Finally, if you completed the optional exercise, you saw how easy it
can be to join data in JSON files with data in Big SQL tables using Spark SQL.
To expand your skills and learn more, enroll in free online courses offered by Big Data University
(http://www.bigdatauniversity.com/) or work through free tutorials included in the BigInsights product
documentation. The HadoopDev web site (https://developer.ibm.com/hadoop/) contains links to these
and other resources.