SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Using Spark with Big SQL
Cynthia M. Saracco
IBM Solution Architect
April 12, 2017
Contents
LAB 1 OVERVIEW......................................................................................................................................................... 4
1.1. WHAT YOU'LL LEARN ................................................................................................................................ 4
1.2. PRE-REQUISITES ..................................................................................................................................... 4
1.3. ABOUT YOUR ENVIRONMENT ..................................................................................................................... 5
1.4. GETTING STARTED ................................................................................................................................... 6
LAB 2 USING SPARK TO WORK WITH BIG SQL TABLES........................................................................................ 7
2.1. CREATING AND POPULATING BIG SQL SAMPLE TABLES ................................................................................ 7
2.2. LAUNCH JSQSH AND CONNECT TO YOUR DATABASE ...................................................................... 7
2.3. CREATE BIG SQL HIVE TABLES .................................................................................................. 8
2.4. CREATE A BIG SQL EXTERNALLY MANAGED TABLE........................................................................ 9
2.5. CREATE A BIG SQL TABLE IN HBASE ........................................................................................ 12
2.6. QUERYING AND MANIPULATING BIG SQL DATA THROUGH SPARK ................................................................ 13
2.7. EXPLORE THE BASICS .............................................................................................................. 14
2.8. JOIN DATA FROM MULTIPLE TABLES............................................................................................ 18
2.9. USE SPARK MLLIB TO WORK WITH BIG SQL DATA ...................................................................... 21
2.10. OPTIONAL: JOIN BIG SQL AND JSON DATA ............................................................................... 22
LAB 3 SUMMARY ....................................................................................................................................................... 32
4
Lab 1 Overview
This hands-on lab helps you explore how Spark programmers can work with data managed by Big SQL.
Big SQL is high-performance query engine for IBM BigInsights and the Hortonworks Data Platform.
Apache Spark, part of IBM’s Open Platform for Apache Hadoop, is a fast, general-purpose engine for
processing Big Data, including data managed by Hadoop. Particularly appealing to many Spark
programmers are built-in and third-party libraries for machine learning, streaming, SQL, and more.
Given the popularity of both Big SQL and Spark, organizations may want to deploy and use both
technologies. This lab introduces you to one way in which you can integrate these technologies –
namely, by using Spark SQL and its support for JDBC-enabled data sources to manipulate Big SQL
tables in Hive, HBase, or arbitrary HDFS directories. It’s worth noting that other forms of integration are
possible. For example, Big SQL programmers can launch Spark jobs from within their Big SQL queries
and integrate results returned from Spark with Big SQL data. However, such integration is beyond the
scope of this lab. Consult the Big SQL production documentation for further details, if interested.
1.1. What you'll learn
After completing all exercises in this lab, you'll know how to
• Work with data in Big SQL tables using the Spark shell.
• Create and populate Spark DataFrames with data from Big SQL tables.
• Query and join data from Big SQL tables using Spark SQL.
• Use a simple Spark ML (machine learning) function to operate on Big SQL data.
• Optionally, use Spark SQL to join Big SQL data with complex JSON data in HDFS.
Allow 1 – 2 hours to complete this lab. Special thanks to Dan Kikuchi for reviewing this material.
1.2. Pre-requisites
Prior to beginning this lab, you will need access to a BigInsights 4.3 environment as described in the
subsequent section. In addition, you should be familiar with the basics of Big SQL and Spark. Labs 1 –
4 of Getting Started with Big SQL (http://www.slideshare.net/CynthiaSaracco/big-sql40-hol) and the lab
on lab on Working with HBase and Big SQL (https://www.slideshare.net/CynthiaSaracco/h-base-
introlabv4) will help you become familiar with the fundamentals of Big SQL needed for this lab. The
Apache Spark web site (https://spark.apache.org/) contains various resources to help you become
familiar with Spark.
5
1.3. About your environment
This lab requires a BigInsights 4.3 environment in which Big SQL, JSqsh (a SQL command-line
interface), HBase, and Spark are installed and running. Big SQL and JSqsh are part of IBM BigInsights.
Spark is part of the IBM Open Platform for Apache Hadoop upon which BigInsights is based.
Examples in this lab were tested on a 4-node test cluster running BigInsights 4.3 technical preview 2 with
Spark 2.1; the specific configuration of this cluster is outlined in the following table. If your environment
is different, modify the sample code and instructions as needed to match your configuration.
User Password
Root account root password
Big SQL Administrator bigsql bigsql
Ambari Administrator admin admin
Property Value
Host name myhost-master.fyre.ibm.com
Ambari port number 8080
Big SQL database name bigsql
Big SQL port number 32051
Big SQL installation directory /usr/ibmpacks/bigsql
Big SQL JDBC driver /usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar
JSqsh installation directory /usr/ibmpacks/common-utils/current/jsqsh
Big SQL samples directory /usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data
Spark version 2.1
Spark client /usr/iop/current/spark2-client
About the screen captures, sample code, and environment configuration
Screen captures in this lab depict examples and results that may vary from what you
see when you complete the exercises. In addition, you may need to customize some
code examples to match your environment.
6
1.4. Getting started
To get started with the lab exercises, you need access to a working BigInsights environment, as
described in the previous section. See https://www.ibm.com/analytics/us/en/technology/hadoop/hadoop-
trials.html for download options. Product documentation for BigInsights, including installation instructions
and sample exercises, is available at
https://www.ibm.com/support/knowledgecenter/SSPT3X_4.3.0/com.ibm.swg.im.infosphere.biginsights.w
elcome.doc/doc/welcome.html.
Before continuing with this lab, verify that Big SQL, Spark, and all of their pre-requisite services are
running. If you have any questions or need help getting your environment up and running, visit Hadoop
Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post a message to
the forum. You cannot proceed with subsequent lab exercises without access to a working environment.
7
Lab 2 Using Spark to work with Big SQL tables
With your BigInsights environment running Big SQL and Spark, you’re ready to explore how to access
Big SQL data from the Spark shell. In this exercise, you will
 Create and populate a few Big SQL tables
 Use the Spark shell, Scala, and Spark SQL to query and join data in these tables
 Invoke a Spark ML (machine learning) function over Big SQL data
 Optionally, join data in JSON files with data in Big SQL tables through Spark SQL
This lab presumes you know how to launch and use Big SQL’s command-line interface (JSqsh) to
execute queries and commands. If necessary, consult the Getting Started with Big SQL lab
(http://www.slideshare.net/CynthiaSaracco/big-sql40-hol) or the BigInsights Knowledge Center
(https://www.ibm.com/support/knowledgecenter/SSPT3X_4.3.0/com.ibm.swg.im.infosphere.biginsights.w
elcome.doc/doc/welcome.html) for details.
This lab also presumes that you can access the Big SQL sample data on your local file system. The
sample data ships with Big SQL and is often located in a directory such as
/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data.
2.1. Creating and populating Big SQL sample tables
In this module, you will create and populate several Big SQL tables that you will access from Spark later.
To illustrate the breadth of Big SQL data access available to Spark programmers, your Big SQL tables
will employ different underlying storage managers. Specifically, your sample tables will store data in the
Hive warehouse, in HBase, and in an arbitrary HDFS directory.
This lab uses 3 of the more than 60 tables that comprise the sample Big SQL database, which employs a
star schema design (FACT and DIMENSION tables) typical of a relational data warehouse to model
sales data for various retail products. Unless otherwise indicated, examples presume you’re executing
commands using the bigsql ID.
If you’ve worked through other publicly available Big SQL labs, you may have already completed some of
the necessary work included in this module.
2.2. Launch JSqsh and connect to your database
__1. Launch JSqsh, the Big SQL command line interface. For example, in my environment, I entered
/usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh
__2. Connect to your Big SQL database. If necessary, launch the JSqsh connection wizard to create
a connection.
setup connections
8
2.3. Create Big SQL Hive tables
__3. Create two tables in the Hive warehouse using your default schema (which will be "bigsql" if you
connected into your database as that user). The first table is part of the PRODUCT dimension
and includes information about product lines in different languages. The second table is the
sales FACT table, which tracks transactions (orders) of various products.
-- look up table with product line info in various languages
CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup
( product_line_code INT NOT NULL
, product_line_en VARCHAR(90) NOT NULL
, product_line_de VARCHAR(90), product_line_fr VARCHAR(90)
, product_line_ja VARCHAR(90), product_line_cs VARCHAR(90)
, product_line_da VARCHAR(90), product_line_el VARCHAR(90)
, product_line_es VARCHAR(90), product_line_fi VARCHAR(90)
, product_line_hu VARCHAR(90), product_line_id VARCHAR(90)
, product_line_it VARCHAR(90), product_line_ko VARCHAR(90)
, product_line_ms VARCHAR(90), product_line_nl VARCHAR(90)
, product_line_no VARCHAR(90), product_line_pl VARCHAR(90)
, product_line_pt VARCHAR(90), product_line_ru VARCHAR(90)
, product_line_sc VARCHAR(90), product_line_sv VARCHAR(90)
, product_line_tc VARCHAR(90), product_line_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
-- fact table for sales
CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact
( order_day_key INT NOT NULL
, organization_key INT NOT NULL
, employee_key INT NOT NULL
, retailer_key INT NOT NULL
, retailer_site_key INT NOT NULL
, product_key INT NOT NULL
, promotion_key INT NOT NULL
, order_method_key INT NOT NULL
, sales_order_key INT NOT NULL
, ship_day_key INT NOT NULL
, close_day_key INT NOT NULL
, quantity INT
, unit_cost DOUBLE
, unit_price DOUBLE
, unit_sale_price DOUBLE
, gross_margin DOUBLE
, sale_total DOUBLE
, gross_profit DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
;
9
__4. Load data into each of these tables using sample files provided with Big SQL. Change the FILE
URL specification in each of the following examples to match your environment. Then,
one at a time, issue each LOAD statement and verify that the operation completed successfully.
LOAD returns a warning message providing details on the number of rows loaded, etc.
load hadoop using file url 'sftp://yourID:yourPassword@myhost-
master.fyre.ibm.com:22/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALE
SDW.SLS_PRODUCT_LINE_LOOKUP.txt' with SOURCE PROPERTIES
('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost-
master.fyre.ibm.com:22/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALE
SDW.SLS_SALES_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO
TABLE SLS_SALES_FACT overwrite;
__5. Query the tables to verify that the expected number of rows was loaded into each table. Execute
each query below and compare the results with the number of rows specified in the comment
line preceding each query.
-- total rows in SLS_PRODUCT_LINE_LOOKUP = 5
select count(*) from bigsql.SLS_PRODUCT_LINE_LOOKUP;
-- total rows in SLS_SALES_FACT = 446023
select count(*) from bigsql.SLS_SALES_FACT;
__6. Open a terminal window.
__7. Enable public access to the sample tables in your Hive warehouse.
From the command line, issue this command to switch to the root user ID temporarily:
su root
When prompted, enter the password for this account. Then switch to the hdfs ID.
su hdfs
While logged in as user hdfs, issue this command to provide public access to all Hive warehouse
tables:
hdfs dfs -chmod -R 777 /apps/hive/warehouse
2.4. Create a Big SQL externally managed table
Now that you have 2 Big SQL tables in the Hive warehouse, it's time to create an externally managed Big
SQL table – i.e., a table created over a user directory that resides outside of the Hive warehouse. This
user directory will contain the table’s data in files. Creating such a table effectively layers a SQL schema
over existing HDFS data (or data that you may later upload into the target HDFS directory).
__8. From your terminal window, check the directory permissions for HDFS.
10
hdfs dfs -ls /
If the /user directory cannot be written by the public (as shown in the example above), you will
need to change these permissions so that you can create the necessary subdirectories for this
lab.
While logged in as user hdfs, issue this command:
hdfs dfs -chmod 777 /user
Next, confirm the effect of your change:
hdfs dfs -ls /
Exit the hdfs user account:
exit
Finally, exit the root user account and return to the standard account you’ll be using for this lab
(e.g., bigsql):
exit
__9. Create a directory structure in your distributed file system for the source data file for the product
dimension table. (If desired, alter the HDFS information as appropriate for your environment.)
hdfs dfs -mkdir /user/bigsql_spark_lab
hdfs dfs -mkdir /user/bigsql_spark_lab/sls_product_dim
__10. Upload the source data file (the Big SQL sample data file named
GOSALESDW.SLS_PRODUCT_DIM.txt) into the target DFS directory. Change the local and DFS
directories information below to match your environment.
11
hdfs dfs -copyFromLocal
/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt
/user/bigsql_spark_lab/sls_product_dim/SLS_PRODUCT_DIM.txt
__11. List the contents of the HDFS directory and verify that your sample data file is present.
hdfs dfs -ls /user/bigsql_spark_lab/sls_product_dim
__12. Ensure public access to this lab's directory structure.
hdfs dfs -chmod -R 777 /user/bigsql_spark_lab
__13. Return to your Big SQL query execution environment (JSqsh) and connect to your Big SQL
database.
__14. Create an external Big SQL table for the sales product dimension (extern.sls_product_dim).
Note that the LOCATION clause references the DFS directory into which you copied the sample
data.
-- product dimension table
CREATE EXTERNAL HADOOP TABLE IF NOT EXISTS extern.sls_product_dim
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_number INT NOT NULL
, base_product_key INT NOT NULL
, base_product_number INT NOT NULL
, product_color_code INT
, product_size_code INT
, product_brand_key INT NOT NULL
, product_brand_code INT NOT NULL
, product_image VARCHAR(60)
, introduction_date TIMESTAMP
, discontinued_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
location '/user/bigsql_spark_lab/sls_product_dim';
12
If you encounter a SQL -5105 error message such as the one shown below, the HDFS directory
permissions for your target directory (e.g., /user/bigsql_spark_lab) may be too restrictive.
From an OS terminal window, issue this command:
hdfs dfs -ls /user/bigsql_spark_lab
Your permissions must include rw settings. Consult the earlier steps in this lab for instructions
on how to reset HDFS directory permissions.
__15. Verify that you can query the table.
-- total rows in EXTERN.SLS_PRODUCT_DIM = 274
select count(*) from EXTERN.SLS_PRODUCT_DIM;
2.5. Create a Big SQL table in HBase
Finally, create and populate a Big SQL table managed by HBase. Specifically, create a table that joins
rows from one of your Big SQL Hive tables (bigsql.sls_product_line_lookup) with rows in your
externally managed Big SQL table (extern.sls_product_dim). Your HBase table will effectively
“flatten” (or de-normalize) content in these tables into a structure that’s more efficient for processing in
HBase.
__16. Create a Big SQL table named sls_product_flat managed by HBase and populate this table
with the results of a query spanning two Big SQL tables that you created previously.
CREATE hbase TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
column mapping
(
key mapped by (product_key),
data:c2 mapped by (product_line_code),
data:c3 mapped by (product_type_key),
data:c4 mapped by (product_type_code),
data:c5 mapped by (product_line_en),
data:c6 mapped by (product_line_de)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, bigsql.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;
13
This statement creates a Big SQL table named SLS_PRODUCT_FLAT in the current user schema. The
COLUMN MAPPING clause specifies how SQL columns are to be mapped to HBase columns in column
families. For example, the SQL PRODUCT_KEY column maps to the HBase row key, while other SQL
columns are mapped to various columns within the HBase data column family. For more details about
Big SQL HBase support, including column mappings, see the separate lab on Working with HBase and
Big SQL (https://www.slideshare.net/CynthiaSaracco/h-base-introlabv4) or consult the product
documentation.
__17. Verify that 274 rows are present in your Big SQL HBase table.
select count(*) from bigsql.sls_product_flat;
+-----+
| 1 |
+-----+
| 274 |
+-----+
__18. Optionally, query the table to become familiar with its contents.
select product_key, product_line_code, product_line_en from bigsql.sls_product_flat
where product_key > 30270;
+-------------+-------------------+----------------------+
| PRODUCT_KEY | PRODUCT_LINE_CODE | PRODUCT_LINE_EN |
+-------------+-------------------+----------------------+
| 30271 | 993 | Personal Accessories |
| 30272 | 993 | Personal Accessories |
| 30273 | 993 | Personal Accessories |
| 30274 | 993 | Personal Accessories |
+-------------+-------------------+----------------------+
4 rows in results(first row: 0.106s; total: 0.108s)
2.6. Querying and manipulating Big SQL data through Spark
Now that you’ve created and populated the sample Big SQL tables required for this lab, it’s time to
experiment with accessing them. In this module, you’ll launch the Spark shell and issue Scala commands
and expressions to retrieve Big SQL data. Specifically, you’ll use Spark SQL to query data in Big SQL
tables. You’ll model the result sets from your queries as DataFrames, the Spark equivalent of a table. For
details on DataFrames, visit the Spark web site.
14
2.7. Explore the basics
In this module, you’ll learn how to use Spark’s support for JDBC data sources to access data in a single
Big SQL table.
__1. From a terminal window, launch the Spark shell using the –driver-class-path option to
specify the location of the Big SQL JDBC driver class (db2jcc4.jar) in your environment.
Adjust the directory information below for the Spark shell and Big SQL .jar file to match
your environment.
/usr/iop/current/spark2-client/bin/spark-shell --driver-class-path
/usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar
In this section and others that follow, highlighting designates sample output generated from
commands you enter. (Commands are not highlighted and appear just before sample output.)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://xxx.yy.zzz.164:4040
Spark context available as 'sc' (master = local[*], app id = local-1490640789034).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
__2. From the Spark shell, import classes that you’ll need for subsequent lab work.
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
15
__3. Load data from your sls_product_line_lookup table (in Hive) into a DataFrame named
lookupDF. Adjust the JDBC specification (option values) below as needed to match your
environment.
val lookupDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_line_lookup"
).option("user", "bigsql").option("password", "bigsql").load()
scala> val lookupDF = spark.read.format("jdbc").option("url",
"jdbc:db2://xxxx.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_line_lookup
").option("user", "bigsql").option("password", "bigsql").load()
lookupDF: org.apache.spark.sql.DataFrame = [PRODUCT_LINE_CODE: int, PRODUCT_LINE_EN:
string ... 22 more fields]
As you’ll note, connection property values are hard-coded in this example to make it easy for you
to follow. For production use, you may prefer a more flexible approach through the use of
java.util.Properties; see the Spark API documentation for details.
__4. Display the contents of your DataFrame. The number of columns in your data set, coupled with
the length of many of the string columns, may make the output format of this show() operation
difficult to read on your screen.
lookupDF.show()
scala> lookupDF.show()
+-----------------+--------------------+--------------------+--------------------+----------
-----+--------------------+-------------------+--------------------+--------------------+---
-----------------+--------------------+--------------------+--------------------+-----------
----+--------------------+--------------------+--------------------+--------------------+---
-----------------+--------------------+---------------+--------------------+---------------
+--------------------+
|PRODUCT_LINE_CODE| PRODUCT_LINE_EN| PRODUCT_LINE_DE|
PRODUCT_LINE_FR|PRODUCT_LINE_JA| PRODUCT_LINE_CS| PRODUCT_LINE_DA|
PRODUCT_LINE_EL| PRODUCT_LINE_ES| PRODUCT_LINE_FI| PRODUCT_LINE_HU|
PRODUCT_LINE_ID| PRODUCT_LINE_IT|PRODUCT_LINE_KO| PRODUCT_LINE_MS|
PRODUCT_LINE_NL| PRODUCT_LINE_NO| PRODUCT_LINE_PL| PRODUCT_LINE_PT|
PRODUCT_LINE_RU|PRODUCT_LINE_SC| PRODUCT_LINE_SV|PRODUCT_LINE_TC| PRODUCT_LINE_TH|
+-----------------+--------------------+--------------------+--------------------+----------
-----+--------------------+-------------------+--------------------+--------------------+---
-----------------+--------------------+--------------------+--------------------+-----------
----+--------------------+--------------------+--------------------+--------------------+---
-----------------+--------------------+---------------+--------------------+---------------
+--------------------+
| 991| Camping Equipment| Campingausrüstung| Matériel de camping|
キャンプ用品|Vybavení pro kemp...| Campingudstyr|Εξοπλισμός κατασκ...| Equipo de
acampada| Retkeilyvarusteet| Kempingfelszerelés|Perlengkapan Berk...|Attrezzatura per
...| 캠핑 장비|Kelengkapan Berkh...|Kampeerbenodigdheden|
Campingutstyr|Ekwipunek kempingowy|Equipamento acamp...|Снаряжение для ту...|
16
露营装备| Campingutrustning| 露營器材| อุปกรณ์ตั้งแคมป์ |
| 992|Mountaineering Eq...|Bergsteigerausrüs...|Matériel de montagne|
登山用品|Horolezecké vybavení| Alpint udstyr|Εξοπλισμός ορειβα...|Equipo de
montañismo|Vuorikiipeilyvaru...|Hegymászó-felszer...|Perlengkapan Pend...|Attrezzatura per
...| 등산 장비|Kelengkapan Menda...| Bergsportartikelen| Klatreutstyr|
Sprzęt wspinaczkowy|Equipamento monta...| Горное снаряжение| 登山装备|
Klätterutrustning| 登山器材| อุปกรณ์ปีนเขา|
| 993|Personal Accessories| Accessoires|Accessoires perso...|
個人装備| Věci osobní potřeby|Personligt tilbehør| Προσωπικά είδη|Accesorios
person...|Henkilökohtaiset ...|Személyes kiegész...| Aksesori pribadi| Accessori
personali| 개인 용품| Aksesori Diri|Persoonlijke acce...|Personlig utrustning|
Akcesoria osobiste| Acessórios pessoais|Личные принадлежн...| 个人附件|Personliga
tillbehör| 個人配件| อุปกรณ์ส่วนตัว|
| 994| Outdoor Protection|Outdoor-Schutzaus...|Articles de prote...|
アウトドア用保護用品| Vybavení do přírody|Udendørsbeskyttelse|Προστασία για την...|Protección
aire l...| Ulkoiluvarusteet| Védőfelszerelés|Perlindungan Luar...|Protezione
personale| 야외 보호 장비|Perlindungan Luar...|Buitensport - pre...|Utendørs
beskyttelse|Wyposażenie ochronne| Proteção ar livre| Средства защиты|
户外防护用品| Skyddsartiklar| 戶外防護器材|สิ่งป้องกันเมื่ออ...|
| 995| Golf Equipment| Golfausrüstung| Matériel de golf|
ゴルフ用品| Golfové potřeby| Golfudstyr| Εξοπλισμός γκολφ| Equipo de
golf| Golf-varusteet| Golffelszerelés| Perlengkapan Golf|Attrezzatura da golf|
골프 장비| Kelengkapan Golf| Golfartikelen| Golfutstyr| Ekwipunek
golfowy| Equipamento golfe|Снаряжение для го...| 高尔夫球装备| Golfutrustning|
高爾夫球器材| อุปกรณ์กอล์ฟ|
+-----------------+--------------------+--------------------+--------------------+----------
-----+--------------------+-------------------+--------------------+--------------------+---
-----------------+--------------------+--------------------+--------------------+-----------
----+--------------------+--------------------+--------------------+--------------------+---
-----------------+--------------------+---------------+--------------------+---------------
+--------------------+
__5. If desired, experiment with other operations you can perform on DataFrames to inspect their
contents, consulting the Spark online documentation as needed. The example below collects
each row in the DataFrame into an Array and prints the contents of each array element.
lookupDF.collect().foreach(println)
scala> lookupDF.collect().foreach(println)
[991,Camping Equipment,Campingausrüstung,Matériel de camping,キャンプ用品,Vybavení
pro kempování,Campingudstyr,Εξοπλισμός κατασκήνωσης,Equipo de
acampada,Retkeilyvarusteet,Kempingfelszerelés,Perlengkapan Berkemah,Attrezzatura per
campeggio,캠핑 장비,Kelengkapan
Berkhemah,Kampeerbenodigdheden,Campingutstyr,Ekwipunek kempingowy,Equipamento
acampamento,Снаряжение для туризма,露营装备,Campingutrustning,露營器材,อุปกรณ์ตั้งแคมป์ ]
[992,Mountaineering Equipment,Bergsteigerausrüstung,Matériel de
montagne,登山用品,Horolezecké vybavení,Alpint udstyr,Εξοπλισμός ορειβασίας,Equipo de
17
montañismo,Vuorikiipeilyvarusteet,Hegymászó-felszerelés,Perlengkapan Pendaki
Gunung,Attrezzatura per alpinismo,등산 장비,Kelengkapan Mendaki
Gunung,Bergsportartikelen,Klatreutstyr,Sprzęt wspinaczkowy,Equipamento
montanhismo,Горное снаряжение,登山装备,Klätterutrustning,登山器材,อุปกรณ์ปีนเขา]
[993,Personal Accessories,Accessoires,Accessoires personnels,個人装備,Věci osobní
potřeby,Personligt tilbehør,Προσωπικά είδη,Accesorios personales,Henkilökohtaiset
tarvikkeet,Személyes kiegészítők,Aksesori pribadi,Accessori personali,개인
용품,Aksesori Diri,Persoonlijke accessoires,Personlig utrustning,Akcesoria
osobiste,Acessórios pessoais,Личные принадлежности,个人附件,Personliga
tillbehör,個人配件,อุปกรณ์ส่วนตัว]
[994,Outdoor Protection,Outdoor-Schutzausrüstung,Articles de
protection,アウトドア用保護用品,Vybavení do přírody,Udendørsbeskyttelse,Προστασία για
την ύπαιθρο,Protección aire libre,Ulkoiluvarusteet,Védőfelszerelés,Perlindungan Luar
Ruang,Protezione personale,야외 보호 장비,Perlindungan Luar Bangunan,Buitensport -
preventie,Utendørs beskyttelse,Wyposażenie ochronne,Proteção ar livre,Средства
защиты,户外防护用品,Skyddsartiklar,戶外防護器材,สิ่งป้องกันเมื่ออยู่กลางแจ ้ง]
[995,Golf Equipment,Golfausrüstung,Matériel de golf,ゴルフ用品,Golfové
potřeby,Golfudstyr,Εξοπλισμός γκολφ,Equipo de golf,Golf-
varusteet,Golffelszerelés,Perlengkapan Golf,Attrezzatura da golf,골프
장비,Kelengkapan Golf,Golfartikelen,Golfutstyr,Ekwipunek golfowy,Equipamento
golfe,Снаряжение для гольфа,高尔夫球装备,Golfutrustning,高爾夫球器材,อุปกรณ์กอล์ฟ]
__6. To query the DataFrame, create a temporary view of it.
lookupDF.createOrReplaceTempView("lookup")
__7. Use Spark SQL to query your temporary view and display the results. The following example
returns English and French product line information for all product line codes below 995.
spark.sql("select product_line_code, product_line_en, product_line_fr from lookup
where product_line_code < 995").show()
scala> spark.sql("select product_line_code, product_line_en, product_line_fr from
lookup where product_line_code < 995").show()
+-----------------+--------------------+--------------------+
|product_line_code| product_line_en| product_line_fr|
+-----------------+--------------------+--------------------+
| 991| Camping Equipment| Matériel de camping|
| 992|Mountaineering Eq...|Matériel de montagne|
| 993|Personal Accessories|Accessoires perso...|
| 994| Outdoor Protection|Articles de prote...|
+-----------------+--------------------+--------------------+
Note that string values in the last two columns of this result were truncated due to defaults
associated with show().
18
__8. If desired, modify the previous command slightly to display more content in the final two columns
of your result. For example, use show(5,100) to specify that a maximum of 5 rows should be
returned and that string column values should be truncated after 100 characters.
spark.sql("select product_line_code, product_line_en, product_line_fr from lookup
where product_line_code < 995").show(5,100)
scala> spark.sql("select product_line_code, product_line_en, product_line_fr from
lookup where product_line_code < 995").show(5,100)
+-----------------+------------------------+----------------------+
|product_line_code| product_line_en| product_line_fr|
+-----------------+------------------------+----------------------+
| 991| Camping Equipment| Matériel de camping|
| 992|Mountaineering Equipment| Matériel de montagne|
| 993| Personal Accessories|Accessoires personnels|
| 994| Outdoor Protection|Articles de protection|
+-----------------+------------------------+----------------------+
2.8. Join data from multiple tables
In this module, you’ll learn how to use Spark to query data from Big SQL tables managed outside of the
Hive warehouse. In doing so, you’ll see that the underlying storage mechanism that Big SQL uses for its
tables is hidden from the Spark programmer. In other words, you don’t need to know how Big SQL is
storing the data – you simply query Big SQL tables as you would any other JDBC data source.
__9. Following the same approach that you executed in the previous module, create a DataFrame for
your Big SQL extern.sls_product_dim table. Adjust the specifications (option values)
below as needed to match your environment.
val dimDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","extern.sls_product_dim").option
("user", "bigsql").option("password", "bigsql").load()
__10. Create a temporary view so that you can query the data.
dimDF.createOrReplaceTempView("dim")
__11. Query the data set, making sure that there are 274 rows.
spark.sql("select count(*) from dim").show()
scala> spark.sql("select count(*) from dim").show()
+--------+
|count(1)|
+--------+
| 274|
+--------+
__12. Join data in your lookup and dim views and display the results. Recall that lookup loaded data
from a Big SQL Hive table while dim loaded data from a Big SQL externally managed table.
19
spark.sql("select l.product_line_code, l.product_line_en, d.product_number,
d.introduction_date from lookup l, dim d where
l.product_line_code=d.product_line_code limit 15").show()
scala> spark.sql("select l.product_line_code, l.product_line_en, d.product_number,
d.introduction_date from lookup l, dim d where
l.product_line_code=d.product_line_code limit 15").show()
+-----------------+---------------+--------------+--------------------+
|product_line_code|product_line_en|product_number| introduction_date|
+-----------------+---------------+--------------+--------------------+
| 995| Golf Equipment| 101110|2003-12-15 00:00:...|
| 995| Golf Equipment| 102110|2003-12-10 00:00:...|
| 995| Golf Equipment| 103110|2003-12-10 00:00:...|
| 995| Golf Equipment| 104110|2003-12-18 00:00:...|
| 995| Golf Equipment| 105110|2003-12-27 00:00:...|
| 995| Golf Equipment| 106110|2003-12-05 00:00:...|
| 995| Golf Equipment| 107110|2004-01-13 00:00:...|
| 995| Golf Equipment| 108110|2003-12-27 00:00:...|
| 995| Golf Equipment| 109110|2003-12-10 00:00:...|
| 995| Golf Equipment| 110110|2003-12-10 00:00:...|
| 995| Golf Equipment| 111110|2003-12-15 00:00:...|
| 995| Golf Equipment| 112110|2004-01-10 00:00:...|
| 995| Golf Equipment| 113110|2004-01-15 00:00:...|
| 995| Golf Equipment| 114110|2003-12-15 00:00:...|
| 995| Golf Equipment| 115110|2003-12-27 00:00:...|
+-----------------+---------------+--------------+--------------------+
__13. Next, explore how to query a Big SQL HBase table. Create a DataFrame for your
sls_product_fact table. Adjust the specification (option values) below as needed to
match your environment.
val hbaseDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_flat").optio
n("user", "bigsql").option("password", "bigsql").load()
__14. Create a temporary view of your DataFrame.
hbaseDF.createOrReplaceTempView("flat")
20
__15. Query this view.
spark.sql("select * from flat where product_key > 30270").show()
scala> spark.sql("select * from flat where product_key > 30270").show()
+-----------+-----------------+----------------+-----------------+--------------------+---------------+
|PRODUCT_KEY|PRODUCT_LINE_CODE|PRODUCT_TYPE_KEY|PRODUCT_TYPE_CODE| PRODUCT_LINE_EN|PRODUCT_LINE_DE|
+-----------+-----------------+----------------+-----------------+--------------------+---------------+
| 30271| 993| 960| 960|Personal Accessories| Accessoires|
| 30272| 993| 960| 960|Personal Accessories| Accessoires|
| 30273| 993| 960| 960|Personal Accessories| Accessoires|
| 30274| 993| 960| 960|Personal Accessories| Accessoires|
+-----------+-----------------+----------------+-----------------+--------------------+---------------+
__16. To prepare to join data from your Big SQL HBase table with data from a Big SQL Hive table,
create a DataFrame for the sales fact table. Adjust the specification (option values) below
as needed to match your environment.
val factDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_sales_fact").option(
"user", "bigsql").option("password", "bigsql").load()
__17. Create a view of this DataFrame.
factDF.createOrReplaceTempView("fact")
__18. Join data in your temporary fact and hbase views. These views are based on the Big SQL
sales fact table (sls_sales_fact) managed by Hive and the Big SQL dimension table
(sls_product_flat) managed by HBase.
spark.sql("select f.retailer_key, f.product_key, h.product_line_en, f.sale_total
from fact f, flat h where f.product_key = h.product_key and sale_total > 50000 limit
10").show()
21
scala> spark.sql("select f.retailer_key, f.product_key, h.product_line_en,
f.sale_total from fact f, flat h where f.product_key = h.product_key and sale_total
> 50000 limit 10").show()
+------------+-----------+---------------+----------+
|retailer_key|product_key|product_line_en|sale_total|
+------------+-----------+---------------+----------+
| 6870| 30128| Golf Equipment| 60812.38|
| 6875| 30128| Golf Equipment| 79123.29|
| 6872| 30128| Golf Equipment| 61316.35|
| 6875| 30128| Golf Equipment| 79291.28|
| 6875| 30128| Golf Equipment| 77275.4|
| 6875| 30128| Golf Equipment| 65348.11|
| 6871| 30128| Golf Equipment| 51740.92|
| 6875| 30128| Golf Equipment| 71059.77|
| 6875| 30128| Golf Equipment| 79627.26|
| 7154| 30128| Golf Equipment| 58460.52|
+------------+-----------+---------------+----------+
2.9. Use Spark MLlib to work with Big SQL data
Now that you understand how to use Spark SQL to work Big SQL tables, let’s explore how to use other
Spark technologies to manipulate Big SQL data. In this exercise, you’ll use a simple function in Spark’s
machine learning library (MLlib) to transform a DataFrame built from data in one or more Big SQL tables.
Quite often, transformations are needed before using more sophisticated analytical functions available
through MLlib and other libraries.
__19. Import Spark MLlib classes that you’ll be using shortly.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
__20. Create a DataFrame for new query results that involve a Spark SQL temporary view you created
earlier for the Big SQL sales fact table. Your new DataFrame will serve as input to a subsequent
transformation operation.
val test=sql("select quantity, unit_cost, sale_total, gross_profit from fact")
__21. Create a VectorAssembler that combines data in a list of columns into a single vector column. In
this example, two columns (quantity and unit_cost) will be transformed into a new vector
column named data.
val assembler = new
VectorAssembler().setInputCols(Array("quantity","unit_cost")).setOutputCol("data")
__22. Execute the transformation operation and display the results. Note that this form of show()
displays only the top 20 records of the results.
val output = assembler.transform(test).show()
22
scala> val output = assembler.transform(test).show()
+--------+------------------+----------+------------+--------------------+
|quantity| unit_cost|sale_total|gross_profit| data|
+--------+------------------+----------+------------+--------------------+
| 587| 34.9| 41958.76| 21472.46| [587.0,34.9]|
| 214| 91.8| 35949.86| 16304.66| [214.0,91.8]|
| 576| 2.67| 5921.28| 4383.36| [576.0,2.67]|
| 129| 56.23| 11570.01| 4316.34| [129.0,56.23]|
| 1776|1.8599999999999999| 10123.2| 6819.84|[1776.0,1.8599999...|
| 1822| 1.79| 8654.5| 5393.12| [1822.0,1.79]|
| 412| 9.1| 9191.72| 5442.52| [412.0,9.1]|
| 67| 690.0| 79200.7| 32970.7| [67.0,690.0]|
| 97| 238.88| 41543.16| 18371.8| [97.0,238.88]|
| 1172| 6.62| 10278.44| 2519.8| [1172.0,6.62]|
| 591| 34.97| 30838.38| 10171.11| [591.0,34.97]|
| 338| 85.11| 40776.32| 12009.14| [338.0,85.11]|
| 97| 426.0| 61075.08| 19753.08| [97.0,426.0]|
| 364| 86.0| 49704.2| 18400.2| [364.0,86.0]|
| 234| 65.25| 22737.78| 7469.28| [234.0,65.25]|
| 603| 16.0| 19446.75| 9798.75| [603.0,16.0]|
| 232| 18.0| 6625.92| 2449.92| [232.0,18.0]|
| 450| 18.05| 15075.0| 6952.5| [450.0,18.05]|
| 257| 20.0| 7864.2| 2724.2| [257.0,20.0]|
| 191| 15.62| 6434.79| 3451.37| [191.0,15.62]|
+--------+------------------+----------+------------+--------------------+
only showing top 20 rows
2.10. Optional: Join Big SQL and JSON data
In this module, you’ll explore how to use Spark to query JSON data and join that data with data from a
Big SQL table. Spark SQL enables programmers to query data in a variety of popular formats without
manually defining tables or mapping schemas to tables. Since JSON is a very popular data format, you
may find it convenient to use Spark SQL’s built-in JSON support to query data in JSON files. And since
Big SQL is a popular target for managing “cold” corporate data extracted from a relational data
warehouse, you may occasionally need to join JSON and Big SQL data.
To begin, collect some JSON data. Examples in this lab are based on 10-day weather forecasts
generated by The Weather Company’s limited-use free service on Bluemix. For details about this
service, log into Bluemix at http://bluemix.net and visit
https://console.ng.bluemix.net/catalog/services/weather-company-data?taxonomyNavigation=apps. If
using the Bluemix weather service to generate JSON data for your work, follow the instructions below. If
you wish to use different JSON data, skip the remainder of this section.
__1. Log into Bluemix (or register for a free account, if needed).
__2. Search the catalog for “weather” services.
23
__3. Once you locate The Weather Company’s service, follow the standard Bluemix procedure to
create an instance of this service. Consult Bluemix online documentation, if needed, to perform
this task.
__4. After creating your weather service, access its APIs. Consult Bluemix online documentation, if
needed, to perform this task.
__5. From the listing of weather service APIs, select the Daily Forecast API.
__6. Click the service for the 10-day daily forecast by geocode.
24
__7. Scroll through the displayed pages to become familiar with the details of this forecast service.
Note that you can customize input parameters to control the location of the forecast, the units of
measure for the data (e.g., metric) and the language of text data (e.g., English). Accept all
default values and proceed to the bottom of the page. Locate and click the Try it out! button.
__8. When prompted for a username and password, enter the information supplied for your service’s
credentials.
25
The appropriate user name and password are included in the service credentials section of your
service’s main page. Do not enter your Bluemix ID and password; these will be rejected.
You must enter the username and password that were generated for your service when it was
created. If necessary, return to the main menu for your service and click on the Services
Credentials link to expose this information. Consult Bluemix documentation for details.
__9. Inspect the results; a subset is shown here:
26
__10. Review the structure of the JSON data returned by this service, noting that it contains multiple
levels of nesting. Top-level objects represent metadata (such as the language used in
subsequent forecasts, the longitude and latitude of where the forecast apply, etc.) and weather
forecasts. Forecasts contain an array of JSON objects that detail the minimum and maximum
temperatures, the local time for which the forecast is valid, etc. Also included in each forecast
array element are separate night and day forecasts, which contain further details. Keep the
structure of this JSON data in mind, as it dictates the syntax of your queries in subsequent
exercises.
__11. Note that the weather data returned in the Response Body section of this web page splits
information across multiple lines. You need to store the data without carriage returns or line
feeds. To do so, copy the URL displayed in the Request URL section of the web page and paste
it into a new tab on your browser. (Writing an application that calls the REST APIs is more
appropriate for production. However, I wanted to give you a quick way to collect some data.)
__12. Inspect the results, noting the change in displayed format.
27
__13. Copy the contents of this data and paste it into a file on your local file system.
__14. Optionally, repeat the process for another day or alter the geocode input parameter (longitude,
latitude) to collect data about a different location. Store the results in a different file so that you
will have at least 2 different 10-day forecasts to store in BigInsights. The weather API service
includes a Parameters section that you can use to alter the geocode.
__15. Use FTP or SFTP to transfer your weather data file(s) to a local file system for your BigInsights
cluster.
__16. Open a terminal window for your BigInsights cluster.
Issue an HDFS shell command to create a subdirectory within HDFS for test purposes. I created
a /user/saracco/weather subdirectory, as shown below. Modify the commands as needed
for your environment.
hdfs dfs -mkdir /user/saracco
hdfs dfs -mkdir /user/saracco/weather
__17. Copy the file(s) from your local directory to your new HDFS subdirectory. Adjust this command
as needed for your environment:
hdfs dfs -copyFromLocal weather*.* /user/saracco/weather
__18. Change permissions on your subdirectory and its contents. For example:
hdfs dfs -chmod -R 777 /user/saracco/weather
__19. List the contents of your HDFS subdirectory to validate your work, verifying that the files you
uploaded are present and that permissions provide for global access.
hdfs dfs -ls /user/saracco/weather
28
Found 2 items
-rwxrwxrwx 3 saracco bihdfs 30677 2017-04-10 20:56 /user/saracco/weather/
weather10dayApr4.json
-rwxrwxrwx 3 saracco bihdfs 29655 2017-04-10 20:56 /user/saracco/weather/
weather10dayFeb17.json
After you’ve collected some JSON data and uploaded it to HDFS, create and populate a suitable Big
SQL table to be joined with the JSON data. In this section, you’ll create a geolookup table that contains
longitude, latitude, and location information. The longitude and latitude columns will serve as join keys in
a subsequent exercise. If you plan to use different JSON data for your work, modify the table definition
and INSERT statements as needed to create a suitable table.
__20. Launch JSqsh and connect to your Big SQL database.
__21. Create a Big SQL table.
create hadoop table geolookup(longitude decimal(10,7), latitude decimal(10,7),
location varchar(30));
__22. Insert rows into your Big SQL table. For join key columns, ensure that at least some rows
contain data that will match your JSON data. For example, if you’re using weather forecasts
generated by the Bluemix service, inspect the contents of your JSON weather data to determine
appropriate data values for longitude and latitude, taking care to include to proper precision.
This example inserts 3 rows into the geolookup table; 2 of these rows contain longitude / latitude
data that matches JSON weather forecast data I collected earlier.
insert into geolookup values (84.5, 37.17, 'Xinjiang, China');
insert into geolookup values (-121.75, 37.17, 'San Jose, CA USA');
insert into geolookup values (-73.990246,40.730171,'IBM Astor Place, NYC USA');
__23. Exit JSqsh.
Now that you have sample JSON data in HDFS and a suitable Big SQL table, it’s time to query your
data. Let’s start with the JSON data. If you’re not using 10-day weather forecasts for your sample JSON
data, you’ll need to modify some of the instructions below to match your data.
__24. If you don’t already have an open window with the Spark shell launched, launch it now.
Remember to include the appropriate Big SQL JDBC driver information at launch. Adjust the
specification below as needed to match your environment.
spark-shell --driver-class-path /usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar
__25. From the Spark shell, define a path to point to your JSON data set. Adjust the specification
below as needed to match your environment.
val path = "/user/saracco/weather"
scala> val path = "/user/saracco/weather"
path: String = /user/saracco/weather
29
__26. Define a new DataFrame that will read JSON data from this path.
val weatherDF = spark.read.json(path)
scala> val weatherDF = spark.read.json(path)
17/04/03 11:40:59 WARN Utils: Truncated the string representation of a plan since it
was too large. This behavior can be adjusted by setting
'spark.debug.maxToStringFields' in SparkEnv.conf.
weatherDF: org.apache.spark.sql.DataFrame = [forecasts:
array<struct<blurb:string,blurb_author:string,class:string,day:struct<accumulation_p
hrase:string,alt_daypart_name:string,clds:bigint,day_ind:string,daypart_name:string,
fcst_valid:bigint,fcst_valid_local:string,golf_category:string,golf_index:bigint,hi:
bigint,icon_code:bigint,icon_extd:bigint,log_daypart_name:string,narrative:string,nu
m:bigint,phrase_12char:string,phrase_22char:string,phrase_32char:string,pop:bigint,p
op_phrase:string,precip_type:string,qpf:double,qualifier:string,qualifier_code:strin
g,... 24 more
fields>,dow:string,expire_time_gmt:bigint,fcst_valid:bigint,fcst_valid_local:string,
lunar_phase:string,lunar_phase_code:string,lunar_phase_day:bigint,max_temp:bigint,mi
n_temp:bigint,moonrise:string,moonset:string,narrative...
__27. Optionally, print the schema of your JSON data set so you can visualize its structure. (The
screen capture below displays only a portion of the output.)
weatherDF.printSchema()
scala> weatherDF.printSchema()
root
|-- forecasts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- blurb: string (nullable = true)
| | |-- blurb_author: string (nullable = true)
| | |-- class: string (nullable = true)
| | |-- day: struct (nullable = true)
| | | |-- accumulation_phrase: string (nullable = true)
| | | |-- alt_daypart_name: string (nullable = true)
| | | |-- clds: long (nullable = true)
| | | |-- day_ind: string (nullable = true)
| | | |-- daypart_name: string (nullable = true)
. . . .
__28. Create a temporary view of this data.
weatherDF.createOrReplaceTempView("weather")
__29. Create a new DataFrame to hold the results of a query over this view. The example below
extracts the transaction ID from the metadata associated with each JSON record. (My test
scenario contains 2 JSON records.)
val idDF = spark.sql("select metadata.transaction_id from weather")
30
Apart from the JSON path information that specifies the data of interest within each record,
there’s nothing different about this SQL query from any other Spark SQL query.
__30. Display the results.
idDF.show(2,100)
scala> idDF.show(2,100)
+------------------------+
| transaction_id|
+------------------------+
|1459805077803:1340092112|
| 1455739258736:810296662|
+------------------------+
__31. Execute a more selective query that accesses deeply nested JSON objects. This example
retrieves the longitude and latitude, the valid date/time for the night’s forecast for the first
element in the forecast array (forecasts[0].night.fcst_valid_local), and the short description of that
night’s forecast (forecasts[0].night.shortcast).
sql("select metadata.longitude, metadata.latitude,
forecasts[0].night.fcst_valid_local, forecasts[0].night.shortcast from
weather").show()
scala> sql("select metadata.longitude, metadata.latitude,
forecasts[0].night.fcst_valid_local, forecasts[0].night.shortcast from
weather").show()
+---------+--------+-----------------------------------+----------------------------+
|longitude|latitude|forecasts[0].night.fcst_valid_local|forecasts[0].night.shortcast|
+---------+--------+-----------------------------------+----------------------------+
| -121.75| 37.17| 2016-04-04T19:00:...| Mainly clear|
| 84.5| 37.17| 2016-02-17T19:00:...| Partly cloudy|
+---------+--------+-----------------------------------+----------------------------+
Next, query your Big SQL data using Spark. This approach should be very familiar to you by now.
__32. Create a DataFrame for the Big SQL table (geolookup) that you intend to join with the JSON
data.
val geoDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost-
master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.geolookup").option("user
", "bigsql").option("password", "bigsql").load()
__33. Create a temporary view.
geoDF.createOrReplaceTempView("geolookup")
__34. Verify that you can query this view. Given how I populated the geolookup table earlier, the
correct result from this query should be 3.
31
spark.sql("select count(*) from geolookup").show()
__35. Optionally, display the contents of geolookup.
spark.sql("select * from geolookup").show(5,100)
scala> spark.sql("select * from geolookup").show(5,100)
+------------+----------+------------------------+
| LONGITUDE| LATITUDE| LOCATION|
+------------+----------+------------------------+
|-121.7500000|37.1700000| San Jose, CA USA|
| -73.9902460|40.7301710|IBM Astor Place, NYC USA|
| 84.5000000|37.1700000| Xinjiang, China|
+------------+----------+------------------------+
__36. Join the JSON weather forecast data with your Big SQL data. The first 3 columns of this query’s
result set are derived from the JSON data, while the final column’s data is derived from the Big
SQL table. As you’ll note, longitude and latitude data in both source data sets serve as the join
keys.
spark.sql("select w.metadata.longitude, w.metadata.latitude,
w.forecasts[0].narrative, g.location from weather w, geolookup g where
w.metadata.longitude=g.longitude and metadata.latitude=g.latitude").show()
scala> spark.sql("select w.metadata.longitude, w.metadata.latitude,
w.forecasts[0].narrative, g.location from weather w, geolookup g where
w.metadata.longitude=g.longitude and metadata.latitude=g.latitude").show()
+---------+--------+----------------------+----------------+
|longitude|latitude|forecasts[0].narrative| location|
+---------+--------+----------------------+----------------+
| 84.5| 37.17| Partly cloudy. Lo...| Xinjiang, China|
| -121.75| 37.17| Abundant sunshine...|San Jose, CA USA|
+---------+--------+----------------------+----------------+
32
Lab 3 Summary
In this lab, you explored one way of using Spark to work with data in Big SQL tables stored in the Hive
warehouse, in HBase and in an arbitrary HDFS directory. Through Spark SQL’s support for JDBC data
sources, you queried these tables and even invoked a simple Spark MLlib transformative operation
against data in one of these tables. Finally, if you completed the optional exercise, you saw how easy it
can be to join data in JSON files with data in Big SQL tables using Spark SQL.
To expand your skills and learn more, enroll in free online courses offered by Big Data University
(http://www.bigdatauniversity.com/) or work through free tutorials included in the BigInsights product
documentation. The HadoopDev web site (https://developer.ibm.com/hadoop/) contains links to these
and other resources.
33
© Copyright IBM Corporation 2017. Written by C. M. Saracco.
The information contained in these materials is provided for
informational purposes only, and is provided AS IS without warranty
of any kind, express or implied. IBM shall not be responsible for any
damages arising out of the use of, or otherwise related to, these
materials. Nothing contained in these materials is intended to, nor
shall have the effect of, creating any warranties or representations
from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of
IBM software. References in these materials to IBM products,
programs, or services do not imply that they will be available in all
countries in which IBM operates. This information is based on
current IBM product plans and strategy, which are subject to change
by IBM without notice. Product release dates and/or capabilities
referenced in these materials may change at any time at IBM’s sole
discretion based on market opportunities or other factors, and are not
intended to be a commitment to future product or feature availability
in any way.
IBM, the IBM logo and ibm.com are trademarks of International
Business Machines Corp., registered in many jurisdictions
worldwide. Other product and service names might be trademarks of
IBM or other companies. A current list of IBM trademarks is
available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml.

Contenu connexe

Tendances

Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Nicolas Morales
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase Cynthia Saracco
 
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Nicolas Morales
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study labCynthia Saracco
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Nicolas Morales
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019Dave Stokes
 
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...Stuart Moore
 
SQL Server Extended Events
SQL Server Extended Events SQL Server Extended Events
SQL Server Extended Events Stuart Moore
 
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019Dave Stokes
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 

Tendances (18)

Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase
 
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
SQL Server Extended Events presentation from SQL Midlands User Group 14th Mar...
 
SQL Server Extended Events
SQL Server Extended Events SQL Server Extended Events
SQL Server Extended Events
 
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
MySQL 8.0 Features -- Oracle CodeOne 2019, All Things Open 2019
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Sqrrl and Accumulo
Sqrrl and AccumuloSqrrl and Accumulo
Sqrrl and Accumulo
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 

Similaire à Big Data: Working with Big SQL data from Spark

Talend openstudio bigdata_gettingstarted_6.3.0_en
Talend openstudio bigdata_gettingstarted_6.3.0_enTalend openstudio bigdata_gettingstarted_6.3.0_en
Talend openstudio bigdata_gettingstarted_6.3.0_enManoj Sharma
 
Sql server 2012 tutorials writing transact-sql statements
Sql server 2012 tutorials   writing transact-sql statementsSql server 2012 tutorials   writing transact-sql statements
Sql server 2012 tutorials writing transact-sql statementsSteve Xu
 
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"Lviv Startup Club
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Spark View Engine (Richmond)
Spark View Engine (Richmond)Spark View Engine (Richmond)
Spark View Engine (Richmond)curtismitchell
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache SparkQuantUniversity
 
HDinsight Workshop - Prerequisite Activity
HDinsight Workshop - Prerequisite ActivityHDinsight Workshop - Prerequisite Activity
HDinsight Workshop - Prerequisite ActivityIdan Tohami
 
Hortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureHortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureAnita Luthra
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorBlueData, Inc.
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache SparkEdureka!
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 

Similaire à Big Data: Working with Big SQL data from Spark (20)

hbase lab
hbase labhbase lab
hbase lab
 
Spark1
Spark1Spark1
Spark1
 
Talend openstudio bigdata_gettingstarted_6.3.0_en
Talend openstudio bigdata_gettingstarted_6.3.0_enTalend openstudio bigdata_gettingstarted_6.3.0_en
Talend openstudio bigdata_gettingstarted_6.3.0_en
 
Sql server 2012 tutorials writing transact-sql statements
Sql server 2012 tutorials   writing transact-sql statementsSql server 2012 tutorials   writing transact-sql statements
Sql server 2012 tutorials writing transact-sql statements
 
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
Andriy Zrobok "MS SQL 2019 - new for Big Data Processing"
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Spark View Engine (Richmond)
Spark View Engine (Richmond)Spark View Engine (Richmond)
Spark View Engine (Richmond)
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
HDinsight Workshop - Prerequisite Activity
HDinsight Workshop - Prerequisite ActivityHDinsight Workshop - Prerequisite Activity
HDinsight Workshop - Prerequisite Activity
 
Apache spark
Apache sparkApache spark
Apache spark
 
Hortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureHortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on Azure
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 

Plus de Cynthia Saracco

Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS dataCynthia Saracco
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data AnalyticsCynthia Saracco
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsightsCynthia Saracco
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
 

Plus de Cynthia Saracco (6)

Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS data
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data:  Technical Introduction to BigSheets for InfoSphere BigInsightsBig Data:  Technical Introduction to BigSheets for InfoSphere BigInsights
Big Data: Technical Introduction to BigSheets for InfoSphere BigInsights
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Big Data: Working with Big SQL data from Spark

  • 1. Using Spark with Big SQL Cynthia M. Saracco IBM Solution Architect April 12, 2017
  • 2. Contents LAB 1 OVERVIEW......................................................................................................................................................... 4 1.1. WHAT YOU'LL LEARN ................................................................................................................................ 4 1.2. PRE-REQUISITES ..................................................................................................................................... 4 1.3. ABOUT YOUR ENVIRONMENT ..................................................................................................................... 5 1.4. GETTING STARTED ................................................................................................................................... 6 LAB 2 USING SPARK TO WORK WITH BIG SQL TABLES........................................................................................ 7 2.1. CREATING AND POPULATING BIG SQL SAMPLE TABLES ................................................................................ 7 2.2. LAUNCH JSQSH AND CONNECT TO YOUR DATABASE ...................................................................... 7 2.3. CREATE BIG SQL HIVE TABLES .................................................................................................. 8 2.4. CREATE A BIG SQL EXTERNALLY MANAGED TABLE........................................................................ 9 2.5. CREATE A BIG SQL TABLE IN HBASE ........................................................................................ 12 2.6. QUERYING AND MANIPULATING BIG SQL DATA THROUGH SPARK ................................................................ 13 2.7. EXPLORE THE BASICS .............................................................................................................. 14 2.8. JOIN DATA FROM MULTIPLE TABLES............................................................................................ 18 2.9. USE SPARK MLLIB TO WORK WITH BIG SQL DATA ...................................................................... 21 2.10. OPTIONAL: JOIN BIG SQL AND JSON DATA ............................................................................... 22 LAB 3 SUMMARY ....................................................................................................................................................... 32
  • 3.
  • 4. 4 Lab 1 Overview This hands-on lab helps you explore how Spark programmers can work with data managed by Big SQL. Big SQL is high-performance query engine for IBM BigInsights and the Hortonworks Data Platform. Apache Spark, part of IBM’s Open Platform for Apache Hadoop, is a fast, general-purpose engine for processing Big Data, including data managed by Hadoop. Particularly appealing to many Spark programmers are built-in and third-party libraries for machine learning, streaming, SQL, and more. Given the popularity of both Big SQL and Spark, organizations may want to deploy and use both technologies. This lab introduces you to one way in which you can integrate these technologies – namely, by using Spark SQL and its support for JDBC-enabled data sources to manipulate Big SQL tables in Hive, HBase, or arbitrary HDFS directories. It’s worth noting that other forms of integration are possible. For example, Big SQL programmers can launch Spark jobs from within their Big SQL queries and integrate results returned from Spark with Big SQL data. However, such integration is beyond the scope of this lab. Consult the Big SQL production documentation for further details, if interested. 1.1. What you'll learn After completing all exercises in this lab, you'll know how to • Work with data in Big SQL tables using the Spark shell. • Create and populate Spark DataFrames with data from Big SQL tables. • Query and join data from Big SQL tables using Spark SQL. • Use a simple Spark ML (machine learning) function to operate on Big SQL data. • Optionally, use Spark SQL to join Big SQL data with complex JSON data in HDFS. Allow 1 – 2 hours to complete this lab. Special thanks to Dan Kikuchi for reviewing this material. 1.2. Pre-requisites Prior to beginning this lab, you will need access to a BigInsights 4.3 environment as described in the subsequent section. In addition, you should be familiar with the basics of Big SQL and Spark. Labs 1 – 4 of Getting Started with Big SQL (http://www.slideshare.net/CynthiaSaracco/big-sql40-hol) and the lab on lab on Working with HBase and Big SQL (https://www.slideshare.net/CynthiaSaracco/h-base- introlabv4) will help you become familiar with the fundamentals of Big SQL needed for this lab. The Apache Spark web site (https://spark.apache.org/) contains various resources to help you become familiar with Spark.
  • 5. 5 1.3. About your environment This lab requires a BigInsights 4.3 environment in which Big SQL, JSqsh (a SQL command-line interface), HBase, and Spark are installed and running. Big SQL and JSqsh are part of IBM BigInsights. Spark is part of the IBM Open Platform for Apache Hadoop upon which BigInsights is based. Examples in this lab were tested on a 4-node test cluster running BigInsights 4.3 technical preview 2 with Spark 2.1; the specific configuration of this cluster is outlined in the following table. If your environment is different, modify the sample code and instructions as needed to match your configuration. User Password Root account root password Big SQL Administrator bigsql bigsql Ambari Administrator admin admin Property Value Host name myhost-master.fyre.ibm.com Ambari port number 8080 Big SQL database name bigsql Big SQL port number 32051 Big SQL installation directory /usr/ibmpacks/bigsql Big SQL JDBC driver /usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar JSqsh installation directory /usr/ibmpacks/common-utils/current/jsqsh Big SQL samples directory /usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data Spark version 2.1 Spark client /usr/iop/current/spark2-client About the screen captures, sample code, and environment configuration Screen captures in this lab depict examples and results that may vary from what you see when you complete the exercises. In addition, you may need to customize some code examples to match your environment.
  • 6. 6 1.4. Getting started To get started with the lab exercises, you need access to a working BigInsights environment, as described in the previous section. See https://www.ibm.com/analytics/us/en/technology/hadoop/hadoop- trials.html for download options. Product documentation for BigInsights, including installation instructions and sample exercises, is available at https://www.ibm.com/support/knowledgecenter/SSPT3X_4.3.0/com.ibm.swg.im.infosphere.biginsights.w elcome.doc/doc/welcome.html. Before continuing with this lab, verify that Big SQL, Spark, and all of their pre-requisite services are running. If you have any questions or need help getting your environment up and running, visit Hadoop Dev (https://developer.ibm.com/hadoop/) and review the product documentation or post a message to the forum. You cannot proceed with subsequent lab exercises without access to a working environment.
  • 7. 7 Lab 2 Using Spark to work with Big SQL tables With your BigInsights environment running Big SQL and Spark, you’re ready to explore how to access Big SQL data from the Spark shell. In this exercise, you will  Create and populate a few Big SQL tables  Use the Spark shell, Scala, and Spark SQL to query and join data in these tables  Invoke a Spark ML (machine learning) function over Big SQL data  Optionally, join data in JSON files with data in Big SQL tables through Spark SQL This lab presumes you know how to launch and use Big SQL’s command-line interface (JSqsh) to execute queries and commands. If necessary, consult the Getting Started with Big SQL lab (http://www.slideshare.net/CynthiaSaracco/big-sql40-hol) or the BigInsights Knowledge Center (https://www.ibm.com/support/knowledgecenter/SSPT3X_4.3.0/com.ibm.swg.im.infosphere.biginsights.w elcome.doc/doc/welcome.html) for details. This lab also presumes that you can access the Big SQL sample data on your local file system. The sample data ships with Big SQL and is often located in a directory such as /usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data. 2.1. Creating and populating Big SQL sample tables In this module, you will create and populate several Big SQL tables that you will access from Spark later. To illustrate the breadth of Big SQL data access available to Spark programmers, your Big SQL tables will employ different underlying storage managers. Specifically, your sample tables will store data in the Hive warehouse, in HBase, and in an arbitrary HDFS directory. This lab uses 3 of the more than 60 tables that comprise the sample Big SQL database, which employs a star schema design (FACT and DIMENSION tables) typical of a relational data warehouse to model sales data for various retail products. Unless otherwise indicated, examples presume you’re executing commands using the bigsql ID. If you’ve worked through other publicly available Big SQL labs, you may have already completed some of the necessary work included in this module. 2.2. Launch JSqsh and connect to your database __1. Launch JSqsh, the Big SQL command line interface. For example, in my environment, I entered /usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh __2. Connect to your Big SQL database. If necessary, launch the JSqsh connection wizard to create a connection. setup connections
  • 8. 8 2.3. Create Big SQL Hive tables __3. Create two tables in the Hive warehouse using your default schema (which will be "bigsql" if you connected into your database as that user). The first table is part of the PRODUCT dimension and includes information about product lines in different languages. The second table is the sales FACT table, which tracks transactions (orders) of various products. -- look up table with product line info in various languages CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup ( product_line_code INT NOT NULL , product_line_en VARCHAR(90) NOT NULL , product_line_de VARCHAR(90), product_line_fr VARCHAR(90) , product_line_ja VARCHAR(90), product_line_cs VARCHAR(90) , product_line_da VARCHAR(90), product_line_el VARCHAR(90) , product_line_es VARCHAR(90), product_line_fi VARCHAR(90) , product_line_hu VARCHAR(90), product_line_id VARCHAR(90) , product_line_it VARCHAR(90), product_line_ko VARCHAR(90) , product_line_ms VARCHAR(90), product_line_nl VARCHAR(90) , product_line_no VARCHAR(90), product_line_pl VARCHAR(90) , product_line_pt VARCHAR(90), product_line_ru VARCHAR(90) , product_line_sc VARCHAR(90), product_line_sv VARCHAR(90) , product_line_tc VARCHAR(90), product_line_th VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; -- fact table for sales CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact ( order_day_key INT NOT NULL , organization_key INT NOT NULL , employee_key INT NOT NULL , retailer_key INT NOT NULL , retailer_site_key INT NOT NULL , product_key INT NOT NULL , promotion_key INT NOT NULL , order_method_key INT NOT NULL , sales_order_key INT NOT NULL , ship_day_key INT NOT NULL , close_day_key INT NOT NULL , quantity INT , unit_cost DOUBLE , unit_price DOUBLE , unit_sale_price DOUBLE , gross_margin DOUBLE , sale_total DOUBLE , gross_profit DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE ;
  • 9. 9 __4. Load data into each of these tables using sample files provided with Big SQL. Change the FILE URL specification in each of the following examples to match your environment. Then, one at a time, issue each LOAD statement and verify that the operation completed successfully. LOAD returns a warning message providing details on the number of rows loaded, etc. load hadoop using file url 'sftp://yourID:yourPassword@myhost- master.fyre.ibm.com:22/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALE SDW.SLS_PRODUCT_LINE_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite; load hadoop using file url 'sftp://yourID:yourPassword@myhost- master.fyre.ibm.com:22/usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALE SDW.SLS_SALES_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE SLS_SALES_FACT overwrite; __5. Query the tables to verify that the expected number of rows was loaded into each table. Execute each query below and compare the results with the number of rows specified in the comment line preceding each query. -- total rows in SLS_PRODUCT_LINE_LOOKUP = 5 select count(*) from bigsql.SLS_PRODUCT_LINE_LOOKUP; -- total rows in SLS_SALES_FACT = 446023 select count(*) from bigsql.SLS_SALES_FACT; __6. Open a terminal window. __7. Enable public access to the sample tables in your Hive warehouse. From the command line, issue this command to switch to the root user ID temporarily: su root When prompted, enter the password for this account. Then switch to the hdfs ID. su hdfs While logged in as user hdfs, issue this command to provide public access to all Hive warehouse tables: hdfs dfs -chmod -R 777 /apps/hive/warehouse 2.4. Create a Big SQL externally managed table Now that you have 2 Big SQL tables in the Hive warehouse, it's time to create an externally managed Big SQL table – i.e., a table created over a user directory that resides outside of the Hive warehouse. This user directory will contain the table’s data in files. Creating such a table effectively layers a SQL schema over existing HDFS data (or data that you may later upload into the target HDFS directory). __8. From your terminal window, check the directory permissions for HDFS.
  • 10. 10 hdfs dfs -ls / If the /user directory cannot be written by the public (as shown in the example above), you will need to change these permissions so that you can create the necessary subdirectories for this lab. While logged in as user hdfs, issue this command: hdfs dfs -chmod 777 /user Next, confirm the effect of your change: hdfs dfs -ls / Exit the hdfs user account: exit Finally, exit the root user account and return to the standard account you’ll be using for this lab (e.g., bigsql): exit __9. Create a directory structure in your distributed file system for the source data file for the product dimension table. (If desired, alter the HDFS information as appropriate for your environment.) hdfs dfs -mkdir /user/bigsql_spark_lab hdfs dfs -mkdir /user/bigsql_spark_lab/sls_product_dim __10. Upload the source data file (the Big SQL sample data file named GOSALESDW.SLS_PRODUCT_DIM.txt) into the target DFS directory. Change the local and DFS directories information below to match your environment.
  • 11. 11 hdfs dfs -copyFromLocal /usr/ibmpacks/bigsql/4.3.0.0/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt /user/bigsql_spark_lab/sls_product_dim/SLS_PRODUCT_DIM.txt __11. List the contents of the HDFS directory and verify that your sample data file is present. hdfs dfs -ls /user/bigsql_spark_lab/sls_product_dim __12. Ensure public access to this lab's directory structure. hdfs dfs -chmod -R 777 /user/bigsql_spark_lab __13. Return to your Big SQL query execution environment (JSqsh) and connect to your Big SQL database. __14. Create an external Big SQL table for the sales product dimension (extern.sls_product_dim). Note that the LOCATION clause references the DFS directory into which you copied the sample data. -- product dimension table CREATE EXTERNAL HADOOP TABLE IF NOT EXISTS extern.sls_product_dim ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_number INT NOT NULL , base_product_key INT NOT NULL , base_product_number INT NOT NULL , product_color_code INT , product_size_code INT , product_brand_key INT NOT NULL , product_brand_code INT NOT NULL , product_image VARCHAR(60) , introduction_date TIMESTAMP , discontinued_date TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' location '/user/bigsql_spark_lab/sls_product_dim';
  • 12. 12 If you encounter a SQL -5105 error message such as the one shown below, the HDFS directory permissions for your target directory (e.g., /user/bigsql_spark_lab) may be too restrictive. From an OS terminal window, issue this command: hdfs dfs -ls /user/bigsql_spark_lab Your permissions must include rw settings. Consult the earlier steps in this lab for instructions on how to reset HDFS directory permissions. __15. Verify that you can query the table. -- total rows in EXTERN.SLS_PRODUCT_DIM = 274 select count(*) from EXTERN.SLS_PRODUCT_DIM; 2.5. Create a Big SQL table in HBase Finally, create and populate a Big SQL table managed by HBase. Specifically, create a table that joins rows from one of your Big SQL Hive tables (bigsql.sls_product_line_lookup) with rows in your externally managed Big SQL table (extern.sls_product_dim). Your HBase table will effectively “flatten” (or de-normalize) content in these tables into a structure that’s more efficient for processing in HBase. __16. Create a Big SQL table named sls_product_flat managed by HBase and populate this table with the results of a query spanning two Big SQL tables that you created previously. CREATE hbase TABLE IF NOT EXISTS sls_product_flat ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_line_en VARCHAR(90) , product_line_de VARCHAR(90) ) column mapping ( key mapped by (product_key), data:c2 mapped by (product_line_code), data:c3 mapped by (product_type_key), data:c4 mapped by (product_type_code), data:c5 mapped by (product_line_en), data:c6 mapped by (product_line_de) ) as select product_key, d.product_line_code, product_type_key, product_type_code, product_line_en, product_line_de from extern.sls_product_dim d, bigsql.sls_product_line_lookup l where d.product_line_code = l.product_line_code;
  • 13. 13 This statement creates a Big SQL table named SLS_PRODUCT_FLAT in the current user schema. The COLUMN MAPPING clause specifies how SQL columns are to be mapped to HBase columns in column families. For example, the SQL PRODUCT_KEY column maps to the HBase row key, while other SQL columns are mapped to various columns within the HBase data column family. For more details about Big SQL HBase support, including column mappings, see the separate lab on Working with HBase and Big SQL (https://www.slideshare.net/CynthiaSaracco/h-base-introlabv4) or consult the product documentation. __17. Verify that 274 rows are present in your Big SQL HBase table. select count(*) from bigsql.sls_product_flat; +-----+ | 1 | +-----+ | 274 | +-----+ __18. Optionally, query the table to become familiar with its contents. select product_key, product_line_code, product_line_en from bigsql.sls_product_flat where product_key > 30270; +-------------+-------------------+----------------------+ | PRODUCT_KEY | PRODUCT_LINE_CODE | PRODUCT_LINE_EN | +-------------+-------------------+----------------------+ | 30271 | 993 | Personal Accessories | | 30272 | 993 | Personal Accessories | | 30273 | 993 | Personal Accessories | | 30274 | 993 | Personal Accessories | +-------------+-------------------+----------------------+ 4 rows in results(first row: 0.106s; total: 0.108s) 2.6. Querying and manipulating Big SQL data through Spark Now that you’ve created and populated the sample Big SQL tables required for this lab, it’s time to experiment with accessing them. In this module, you’ll launch the Spark shell and issue Scala commands and expressions to retrieve Big SQL data. Specifically, you’ll use Spark SQL to query data in Big SQL tables. You’ll model the result sets from your queries as DataFrames, the Spark equivalent of a table. For details on DataFrames, visit the Spark web site.
  • 14. 14 2.7. Explore the basics In this module, you’ll learn how to use Spark’s support for JDBC data sources to access data in a single Big SQL table. __1. From a terminal window, launch the Spark shell using the –driver-class-path option to specify the location of the Big SQL JDBC driver class (db2jcc4.jar) in your environment. Adjust the directory information below for the Spark shell and Big SQL .jar file to match your environment. /usr/iop/current/spark2-client/bin/spark-shell --driver-class-path /usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar In this section and others that follow, highlighting designates sample output generated from commands you enter. (Commands are not highlighted and appear just before sample output.) Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://xxx.yy.zzz.164:4040 Spark context available as 'sc' (master = local[*], app id = local-1490640789034). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.1.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_77) Type in expressions to have them evaluated. Type :help for more information. scala> __2. From the Spark shell, import classes that you’ll need for subsequent lab work. import org.apache.spark.sql.Row import org.apache.spark.sql.SparkSession scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession
  • 15. 15 __3. Load data from your sls_product_line_lookup table (in Hive) into a DataFrame named lookupDF. Adjust the JDBC specification (option values) below as needed to match your environment. val lookupDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost- master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_line_lookup" ).option("user", "bigsql").option("password", "bigsql").load() scala> val lookupDF = spark.read.format("jdbc").option("url", "jdbc:db2://xxxx.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_line_lookup ").option("user", "bigsql").option("password", "bigsql").load() lookupDF: org.apache.spark.sql.DataFrame = [PRODUCT_LINE_CODE: int, PRODUCT_LINE_EN: string ... 22 more fields] As you’ll note, connection property values are hard-coded in this example to make it easy for you to follow. For production use, you may prefer a more flexible approach through the use of java.util.Properties; see the Spark API documentation for details. __4. Display the contents of your DataFrame. The number of columns in your data set, coupled with the length of many of the string columns, may make the output format of this show() operation difficult to read on your screen. lookupDF.show() scala> lookupDF.show() +-----------------+--------------------+--------------------+--------------------+---------- -----+--------------------+-------------------+--------------------+--------------------+--- -----------------+--------------------+--------------------+--------------------+----------- ----+--------------------+--------------------+--------------------+--------------------+--- -----------------+--------------------+---------------+--------------------+--------------- +--------------------+ |PRODUCT_LINE_CODE| PRODUCT_LINE_EN| PRODUCT_LINE_DE| PRODUCT_LINE_FR|PRODUCT_LINE_JA| PRODUCT_LINE_CS| PRODUCT_LINE_DA| PRODUCT_LINE_EL| PRODUCT_LINE_ES| PRODUCT_LINE_FI| PRODUCT_LINE_HU| PRODUCT_LINE_ID| PRODUCT_LINE_IT|PRODUCT_LINE_KO| PRODUCT_LINE_MS| PRODUCT_LINE_NL| PRODUCT_LINE_NO| PRODUCT_LINE_PL| PRODUCT_LINE_PT| PRODUCT_LINE_RU|PRODUCT_LINE_SC| PRODUCT_LINE_SV|PRODUCT_LINE_TC| PRODUCT_LINE_TH| +-----------------+--------------------+--------------------+--------------------+---------- -----+--------------------+-------------------+--------------------+--------------------+--- -----------------+--------------------+--------------------+--------------------+----------- ----+--------------------+--------------------+--------------------+--------------------+--- -----------------+--------------------+---------------+--------------------+--------------- +--------------------+ | 991| Camping Equipment| Campingausrüstung| Matériel de camping| キャンプ用品|Vybavení pro kemp...| Campingudstyr|Εξοπλισμός κατασκ...| Equipo de acampada| Retkeilyvarusteet| Kempingfelszerelés|Perlengkapan Berk...|Attrezzatura per ...| 캠핑 장비|Kelengkapan Berkh...|Kampeerbenodigdheden| Campingutstyr|Ekwipunek kempingowy|Equipamento acamp...|Снаряжение для ту...|
  • 16. 16 露营装备| Campingutrustning| 露營器材| อุปกรณ์ตั้งแคมป์ | | 992|Mountaineering Eq...|Bergsteigerausrüs...|Matériel de montagne| 登山用品|Horolezecké vybavení| Alpint udstyr|Εξοπλισμός ορειβα...|Equipo de montañismo|Vuorikiipeilyvaru...|Hegymászó-felszer...|Perlengkapan Pend...|Attrezzatura per ...| 등산 장비|Kelengkapan Menda...| Bergsportartikelen| Klatreutstyr| Sprzęt wspinaczkowy|Equipamento monta...| Горное снаряжение| 登山装备| Klätterutrustning| 登山器材| อุปกรณ์ปีนเขา| | 993|Personal Accessories| Accessoires|Accessoires perso...| 個人装備| Věci osobní potřeby|Personligt tilbehør| Προσωπικά είδη|Accesorios person...|Henkilökohtaiset ...|Személyes kiegész...| Aksesori pribadi| Accessori personali| 개인 용품| Aksesori Diri|Persoonlijke acce...|Personlig utrustning| Akcesoria osobiste| Acessórios pessoais|Личные принадлежн...| 个人附件|Personliga tillbehör| 個人配件| อุปกรณ์ส่วนตัว| | 994| Outdoor Protection|Outdoor-Schutzaus...|Articles de prote...| アウトドア用保護用品| Vybavení do přírody|Udendørsbeskyttelse|Προστασία για την...|Protección aire l...| Ulkoiluvarusteet| Védőfelszerelés|Perlindungan Luar...|Protezione personale| 야외 보호 장비|Perlindungan Luar...|Buitensport - pre...|Utendørs beskyttelse|Wyposażenie ochronne| Proteção ar livre| Средства защиты| 户外防护用品| Skyddsartiklar| 戶外防護器材|สิ่งป้องกันเมื่ออ...| | 995| Golf Equipment| Golfausrüstung| Matériel de golf| ゴルフ用品| Golfové potřeby| Golfudstyr| Εξοπλισμός γκολφ| Equipo de golf| Golf-varusteet| Golffelszerelés| Perlengkapan Golf|Attrezzatura da golf| 골프 장비| Kelengkapan Golf| Golfartikelen| Golfutstyr| Ekwipunek golfowy| Equipamento golfe|Снаряжение для го...| 高尔夫球装备| Golfutrustning| 高爾夫球器材| อุปกรณ์กอล์ฟ| +-----------------+--------------------+--------------------+--------------------+---------- -----+--------------------+-------------------+--------------------+--------------------+--- -----------------+--------------------+--------------------+--------------------+----------- ----+--------------------+--------------------+--------------------+--------------------+--- -----------------+--------------------+---------------+--------------------+--------------- +--------------------+ __5. If desired, experiment with other operations you can perform on DataFrames to inspect their contents, consulting the Spark online documentation as needed. The example below collects each row in the DataFrame into an Array and prints the contents of each array element. lookupDF.collect().foreach(println) scala> lookupDF.collect().foreach(println) [991,Camping Equipment,Campingausrüstung,Matériel de camping,キャンプ用品,Vybavení pro kempování,Campingudstyr,Εξοπλισμός κατασκήνωσης,Equipo de acampada,Retkeilyvarusteet,Kempingfelszerelés,Perlengkapan Berkemah,Attrezzatura per campeggio,캠핑 장비,Kelengkapan Berkhemah,Kampeerbenodigdheden,Campingutstyr,Ekwipunek kempingowy,Equipamento acampamento,Снаряжение для туризма,露营装备,Campingutrustning,露營器材,อุปกรณ์ตั้งแคมป์ ] [992,Mountaineering Equipment,Bergsteigerausrüstung,Matériel de montagne,登山用品,Horolezecké vybavení,Alpint udstyr,Εξοπλισμός ορειβασίας,Equipo de
  • 17. 17 montañismo,Vuorikiipeilyvarusteet,Hegymászó-felszerelés,Perlengkapan Pendaki Gunung,Attrezzatura per alpinismo,등산 장비,Kelengkapan Mendaki Gunung,Bergsportartikelen,Klatreutstyr,Sprzęt wspinaczkowy,Equipamento montanhismo,Горное снаряжение,登山装备,Klätterutrustning,登山器材,อุปกรณ์ปีนเขา] [993,Personal Accessories,Accessoires,Accessoires personnels,個人装備,Věci osobní potřeby,Personligt tilbehør,Προσωπικά είδη,Accesorios personales,Henkilökohtaiset tarvikkeet,Személyes kiegészítők,Aksesori pribadi,Accessori personali,개인 용품,Aksesori Diri,Persoonlijke accessoires,Personlig utrustning,Akcesoria osobiste,Acessórios pessoais,Личные принадлежности,个人附件,Personliga tillbehör,個人配件,อุปกรณ์ส่วนตัว] [994,Outdoor Protection,Outdoor-Schutzausrüstung,Articles de protection,アウトドア用保護用品,Vybavení do přírody,Udendørsbeskyttelse,Προστασία για την ύπαιθρο,Protección aire libre,Ulkoiluvarusteet,Védőfelszerelés,Perlindungan Luar Ruang,Protezione personale,야외 보호 장비,Perlindungan Luar Bangunan,Buitensport - preventie,Utendørs beskyttelse,Wyposażenie ochronne,Proteção ar livre,Средства защиты,户外防护用品,Skyddsartiklar,戶外防護器材,สิ่งป้องกันเมื่ออยู่กลางแจ ้ง] [995,Golf Equipment,Golfausrüstung,Matériel de golf,ゴルフ用品,Golfové potřeby,Golfudstyr,Εξοπλισμός γκολφ,Equipo de golf,Golf- varusteet,Golffelszerelés,Perlengkapan Golf,Attrezzatura da golf,골프 장비,Kelengkapan Golf,Golfartikelen,Golfutstyr,Ekwipunek golfowy,Equipamento golfe,Снаряжение для гольфа,高尔夫球装备,Golfutrustning,高爾夫球器材,อุปกรณ์กอล์ฟ] __6. To query the DataFrame, create a temporary view of it. lookupDF.createOrReplaceTempView("lookup") __7. Use Spark SQL to query your temporary view and display the results. The following example returns English and French product line information for all product line codes below 995. spark.sql("select product_line_code, product_line_en, product_line_fr from lookup where product_line_code < 995").show() scala> spark.sql("select product_line_code, product_line_en, product_line_fr from lookup where product_line_code < 995").show() +-----------------+--------------------+--------------------+ |product_line_code| product_line_en| product_line_fr| +-----------------+--------------------+--------------------+ | 991| Camping Equipment| Matériel de camping| | 992|Mountaineering Eq...|Matériel de montagne| | 993|Personal Accessories|Accessoires perso...| | 994| Outdoor Protection|Articles de prote...| +-----------------+--------------------+--------------------+ Note that string values in the last two columns of this result were truncated due to defaults associated with show().
  • 18. 18 __8. If desired, modify the previous command slightly to display more content in the final two columns of your result. For example, use show(5,100) to specify that a maximum of 5 rows should be returned and that string column values should be truncated after 100 characters. spark.sql("select product_line_code, product_line_en, product_line_fr from lookup where product_line_code < 995").show(5,100) scala> spark.sql("select product_line_code, product_line_en, product_line_fr from lookup where product_line_code < 995").show(5,100) +-----------------+------------------------+----------------------+ |product_line_code| product_line_en| product_line_fr| +-----------------+------------------------+----------------------+ | 991| Camping Equipment| Matériel de camping| | 992|Mountaineering Equipment| Matériel de montagne| | 993| Personal Accessories|Accessoires personnels| | 994| Outdoor Protection|Articles de protection| +-----------------+------------------------+----------------------+ 2.8. Join data from multiple tables In this module, you’ll learn how to use Spark to query data from Big SQL tables managed outside of the Hive warehouse. In doing so, you’ll see that the underlying storage mechanism that Big SQL uses for its tables is hidden from the Spark programmer. In other words, you don’t need to know how Big SQL is storing the data – you simply query Big SQL tables as you would any other JDBC data source. __9. Following the same approach that you executed in the previous module, create a DataFrame for your Big SQL extern.sls_product_dim table. Adjust the specifications (option values) below as needed to match your environment. val dimDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost- master.fyre.ibm.com:32051/BIGSQL").option("dbtable","extern.sls_product_dim").option ("user", "bigsql").option("password", "bigsql").load() __10. Create a temporary view so that you can query the data. dimDF.createOrReplaceTempView("dim") __11. Query the data set, making sure that there are 274 rows. spark.sql("select count(*) from dim").show() scala> spark.sql("select count(*) from dim").show() +--------+ |count(1)| +--------+ | 274| +--------+ __12. Join data in your lookup and dim views and display the results. Recall that lookup loaded data from a Big SQL Hive table while dim loaded data from a Big SQL externally managed table.
  • 19. 19 spark.sql("select l.product_line_code, l.product_line_en, d.product_number, d.introduction_date from lookup l, dim d where l.product_line_code=d.product_line_code limit 15").show() scala> spark.sql("select l.product_line_code, l.product_line_en, d.product_number, d.introduction_date from lookup l, dim d where l.product_line_code=d.product_line_code limit 15").show() +-----------------+---------------+--------------+--------------------+ |product_line_code|product_line_en|product_number| introduction_date| +-----------------+---------------+--------------+--------------------+ | 995| Golf Equipment| 101110|2003-12-15 00:00:...| | 995| Golf Equipment| 102110|2003-12-10 00:00:...| | 995| Golf Equipment| 103110|2003-12-10 00:00:...| | 995| Golf Equipment| 104110|2003-12-18 00:00:...| | 995| Golf Equipment| 105110|2003-12-27 00:00:...| | 995| Golf Equipment| 106110|2003-12-05 00:00:...| | 995| Golf Equipment| 107110|2004-01-13 00:00:...| | 995| Golf Equipment| 108110|2003-12-27 00:00:...| | 995| Golf Equipment| 109110|2003-12-10 00:00:...| | 995| Golf Equipment| 110110|2003-12-10 00:00:...| | 995| Golf Equipment| 111110|2003-12-15 00:00:...| | 995| Golf Equipment| 112110|2004-01-10 00:00:...| | 995| Golf Equipment| 113110|2004-01-15 00:00:...| | 995| Golf Equipment| 114110|2003-12-15 00:00:...| | 995| Golf Equipment| 115110|2003-12-27 00:00:...| +-----------------+---------------+--------------+--------------------+ __13. Next, explore how to query a Big SQL HBase table. Create a DataFrame for your sls_product_fact table. Adjust the specification (option values) below as needed to match your environment. val hbaseDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost- master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_product_flat").optio n("user", "bigsql").option("password", "bigsql").load() __14. Create a temporary view of your DataFrame. hbaseDF.createOrReplaceTempView("flat")
  • 20. 20 __15. Query this view. spark.sql("select * from flat where product_key > 30270").show() scala> spark.sql("select * from flat where product_key > 30270").show() +-----------+-----------------+----------------+-----------------+--------------------+---------------+ |PRODUCT_KEY|PRODUCT_LINE_CODE|PRODUCT_TYPE_KEY|PRODUCT_TYPE_CODE| PRODUCT_LINE_EN|PRODUCT_LINE_DE| +-----------+-----------------+----------------+-----------------+--------------------+---------------+ | 30271| 993| 960| 960|Personal Accessories| Accessoires| | 30272| 993| 960| 960|Personal Accessories| Accessoires| | 30273| 993| 960| 960|Personal Accessories| Accessoires| | 30274| 993| 960| 960|Personal Accessories| Accessoires| +-----------+-----------------+----------------+-----------------+--------------------+---------------+ __16. To prepare to join data from your Big SQL HBase table with data from a Big SQL Hive table, create a DataFrame for the sales fact table. Adjust the specification (option values) below as needed to match your environment. val factDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost- master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.sls_sales_fact").option( "user", "bigsql").option("password", "bigsql").load() __17. Create a view of this DataFrame. factDF.createOrReplaceTempView("fact") __18. Join data in your temporary fact and hbase views. These views are based on the Big SQL sales fact table (sls_sales_fact) managed by Hive and the Big SQL dimension table (sls_product_flat) managed by HBase. spark.sql("select f.retailer_key, f.product_key, h.product_line_en, f.sale_total from fact f, flat h where f.product_key = h.product_key and sale_total > 50000 limit 10").show()
  • 21. 21 scala> spark.sql("select f.retailer_key, f.product_key, h.product_line_en, f.sale_total from fact f, flat h where f.product_key = h.product_key and sale_total > 50000 limit 10").show() +------------+-----------+---------------+----------+ |retailer_key|product_key|product_line_en|sale_total| +------------+-----------+---------------+----------+ | 6870| 30128| Golf Equipment| 60812.38| | 6875| 30128| Golf Equipment| 79123.29| | 6872| 30128| Golf Equipment| 61316.35| | 6875| 30128| Golf Equipment| 79291.28| | 6875| 30128| Golf Equipment| 77275.4| | 6875| 30128| Golf Equipment| 65348.11| | 6871| 30128| Golf Equipment| 51740.92| | 6875| 30128| Golf Equipment| 71059.77| | 6875| 30128| Golf Equipment| 79627.26| | 7154| 30128| Golf Equipment| 58460.52| +------------+-----------+---------------+----------+ 2.9. Use Spark MLlib to work with Big SQL data Now that you understand how to use Spark SQL to work Big SQL tables, let’s explore how to use other Spark technologies to manipulate Big SQL data. In this exercise, you’ll use a simple function in Spark’s machine learning library (MLlib) to transform a DataFrame built from data in one or more Big SQL tables. Quite often, transformations are needed before using more sophisticated analytical functions available through MLlib and other libraries. __19. Import Spark MLlib classes that you’ll be using shortly. import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors __20. Create a DataFrame for new query results that involve a Spark SQL temporary view you created earlier for the Big SQL sales fact table. Your new DataFrame will serve as input to a subsequent transformation operation. val test=sql("select quantity, unit_cost, sale_total, gross_profit from fact") __21. Create a VectorAssembler that combines data in a list of columns into a single vector column. In this example, two columns (quantity and unit_cost) will be transformed into a new vector column named data. val assembler = new VectorAssembler().setInputCols(Array("quantity","unit_cost")).setOutputCol("data") __22. Execute the transformation operation and display the results. Note that this form of show() displays only the top 20 records of the results. val output = assembler.transform(test).show()
  • 22. 22 scala> val output = assembler.transform(test).show() +--------+------------------+----------+------------+--------------------+ |quantity| unit_cost|sale_total|gross_profit| data| +--------+------------------+----------+------------+--------------------+ | 587| 34.9| 41958.76| 21472.46| [587.0,34.9]| | 214| 91.8| 35949.86| 16304.66| [214.0,91.8]| | 576| 2.67| 5921.28| 4383.36| [576.0,2.67]| | 129| 56.23| 11570.01| 4316.34| [129.0,56.23]| | 1776|1.8599999999999999| 10123.2| 6819.84|[1776.0,1.8599999...| | 1822| 1.79| 8654.5| 5393.12| [1822.0,1.79]| | 412| 9.1| 9191.72| 5442.52| [412.0,9.1]| | 67| 690.0| 79200.7| 32970.7| [67.0,690.0]| | 97| 238.88| 41543.16| 18371.8| [97.0,238.88]| | 1172| 6.62| 10278.44| 2519.8| [1172.0,6.62]| | 591| 34.97| 30838.38| 10171.11| [591.0,34.97]| | 338| 85.11| 40776.32| 12009.14| [338.0,85.11]| | 97| 426.0| 61075.08| 19753.08| [97.0,426.0]| | 364| 86.0| 49704.2| 18400.2| [364.0,86.0]| | 234| 65.25| 22737.78| 7469.28| [234.0,65.25]| | 603| 16.0| 19446.75| 9798.75| [603.0,16.0]| | 232| 18.0| 6625.92| 2449.92| [232.0,18.0]| | 450| 18.05| 15075.0| 6952.5| [450.0,18.05]| | 257| 20.0| 7864.2| 2724.2| [257.0,20.0]| | 191| 15.62| 6434.79| 3451.37| [191.0,15.62]| +--------+------------------+----------+------------+--------------------+ only showing top 20 rows 2.10. Optional: Join Big SQL and JSON data In this module, you’ll explore how to use Spark to query JSON data and join that data with data from a Big SQL table. Spark SQL enables programmers to query data in a variety of popular formats without manually defining tables or mapping schemas to tables. Since JSON is a very popular data format, you may find it convenient to use Spark SQL’s built-in JSON support to query data in JSON files. And since Big SQL is a popular target for managing “cold” corporate data extracted from a relational data warehouse, you may occasionally need to join JSON and Big SQL data. To begin, collect some JSON data. Examples in this lab are based on 10-day weather forecasts generated by The Weather Company’s limited-use free service on Bluemix. For details about this service, log into Bluemix at http://bluemix.net and visit https://console.ng.bluemix.net/catalog/services/weather-company-data?taxonomyNavigation=apps. If using the Bluemix weather service to generate JSON data for your work, follow the instructions below. If you wish to use different JSON data, skip the remainder of this section. __1. Log into Bluemix (or register for a free account, if needed). __2. Search the catalog for “weather” services.
  • 23. 23 __3. Once you locate The Weather Company’s service, follow the standard Bluemix procedure to create an instance of this service. Consult Bluemix online documentation, if needed, to perform this task. __4. After creating your weather service, access its APIs. Consult Bluemix online documentation, if needed, to perform this task. __5. From the listing of weather service APIs, select the Daily Forecast API. __6. Click the service for the 10-day daily forecast by geocode.
  • 24. 24 __7. Scroll through the displayed pages to become familiar with the details of this forecast service. Note that you can customize input parameters to control the location of the forecast, the units of measure for the data (e.g., metric) and the language of text data (e.g., English). Accept all default values and proceed to the bottom of the page. Locate and click the Try it out! button. __8. When prompted for a username and password, enter the information supplied for your service’s credentials.
  • 25. 25 The appropriate user name and password are included in the service credentials section of your service’s main page. Do not enter your Bluemix ID and password; these will be rejected. You must enter the username and password that were generated for your service when it was created. If necessary, return to the main menu for your service and click on the Services Credentials link to expose this information. Consult Bluemix documentation for details. __9. Inspect the results; a subset is shown here:
  • 26. 26 __10. Review the structure of the JSON data returned by this service, noting that it contains multiple levels of nesting. Top-level objects represent metadata (such as the language used in subsequent forecasts, the longitude and latitude of where the forecast apply, etc.) and weather forecasts. Forecasts contain an array of JSON objects that detail the minimum and maximum temperatures, the local time for which the forecast is valid, etc. Also included in each forecast array element are separate night and day forecasts, which contain further details. Keep the structure of this JSON data in mind, as it dictates the syntax of your queries in subsequent exercises. __11. Note that the weather data returned in the Response Body section of this web page splits information across multiple lines. You need to store the data without carriage returns or line feeds. To do so, copy the URL displayed in the Request URL section of the web page and paste it into a new tab on your browser. (Writing an application that calls the REST APIs is more appropriate for production. However, I wanted to give you a quick way to collect some data.) __12. Inspect the results, noting the change in displayed format.
  • 27. 27 __13. Copy the contents of this data and paste it into a file on your local file system. __14. Optionally, repeat the process for another day or alter the geocode input parameter (longitude, latitude) to collect data about a different location. Store the results in a different file so that you will have at least 2 different 10-day forecasts to store in BigInsights. The weather API service includes a Parameters section that you can use to alter the geocode. __15. Use FTP or SFTP to transfer your weather data file(s) to a local file system for your BigInsights cluster. __16. Open a terminal window for your BigInsights cluster. Issue an HDFS shell command to create a subdirectory within HDFS for test purposes. I created a /user/saracco/weather subdirectory, as shown below. Modify the commands as needed for your environment. hdfs dfs -mkdir /user/saracco hdfs dfs -mkdir /user/saracco/weather __17. Copy the file(s) from your local directory to your new HDFS subdirectory. Adjust this command as needed for your environment: hdfs dfs -copyFromLocal weather*.* /user/saracco/weather __18. Change permissions on your subdirectory and its contents. For example: hdfs dfs -chmod -R 777 /user/saracco/weather __19. List the contents of your HDFS subdirectory to validate your work, verifying that the files you uploaded are present and that permissions provide for global access. hdfs dfs -ls /user/saracco/weather
  • 28. 28 Found 2 items -rwxrwxrwx 3 saracco bihdfs 30677 2017-04-10 20:56 /user/saracco/weather/ weather10dayApr4.json -rwxrwxrwx 3 saracco bihdfs 29655 2017-04-10 20:56 /user/saracco/weather/ weather10dayFeb17.json After you’ve collected some JSON data and uploaded it to HDFS, create and populate a suitable Big SQL table to be joined with the JSON data. In this section, you’ll create a geolookup table that contains longitude, latitude, and location information. The longitude and latitude columns will serve as join keys in a subsequent exercise. If you plan to use different JSON data for your work, modify the table definition and INSERT statements as needed to create a suitable table. __20. Launch JSqsh and connect to your Big SQL database. __21. Create a Big SQL table. create hadoop table geolookup(longitude decimal(10,7), latitude decimal(10,7), location varchar(30)); __22. Insert rows into your Big SQL table. For join key columns, ensure that at least some rows contain data that will match your JSON data. For example, if you’re using weather forecasts generated by the Bluemix service, inspect the contents of your JSON weather data to determine appropriate data values for longitude and latitude, taking care to include to proper precision. This example inserts 3 rows into the geolookup table; 2 of these rows contain longitude / latitude data that matches JSON weather forecast data I collected earlier. insert into geolookup values (84.5, 37.17, 'Xinjiang, China'); insert into geolookup values (-121.75, 37.17, 'San Jose, CA USA'); insert into geolookup values (-73.990246,40.730171,'IBM Astor Place, NYC USA'); __23. Exit JSqsh. Now that you have sample JSON data in HDFS and a suitable Big SQL table, it’s time to query your data. Let’s start with the JSON data. If you’re not using 10-day weather forecasts for your sample JSON data, you’ll need to modify some of the instructions below to match your data. __24. If you don’t already have an open window with the Spark shell launched, launch it now. Remember to include the appropriate Big SQL JDBC driver information at launch. Adjust the specification below as needed to match your environment. spark-shell --driver-class-path /usr/ibmpacks/bigsql/4.3.0.0/db2/java/db2jcc4.jar __25. From the Spark shell, define a path to point to your JSON data set. Adjust the specification below as needed to match your environment. val path = "/user/saracco/weather" scala> val path = "/user/saracco/weather" path: String = /user/saracco/weather
  • 29. 29 __26. Define a new DataFrame that will read JSON data from this path. val weatherDF = spark.read.json(path) scala> val weatherDF = spark.read.json(path) 17/04/03 11:40:59 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. weatherDF: org.apache.spark.sql.DataFrame = [forecasts: array<struct<blurb:string,blurb_author:string,class:string,day:struct<accumulation_p hrase:string,alt_daypart_name:string,clds:bigint,day_ind:string,daypart_name:string, fcst_valid:bigint,fcst_valid_local:string,golf_category:string,golf_index:bigint,hi: bigint,icon_code:bigint,icon_extd:bigint,log_daypart_name:string,narrative:string,nu m:bigint,phrase_12char:string,phrase_22char:string,phrase_32char:string,pop:bigint,p op_phrase:string,precip_type:string,qpf:double,qualifier:string,qualifier_code:strin g,... 24 more fields>,dow:string,expire_time_gmt:bigint,fcst_valid:bigint,fcst_valid_local:string, lunar_phase:string,lunar_phase_code:string,lunar_phase_day:bigint,max_temp:bigint,mi n_temp:bigint,moonrise:string,moonset:string,narrative... __27. Optionally, print the schema of your JSON data set so you can visualize its structure. (The screen capture below displays only a portion of the output.) weatherDF.printSchema() scala> weatherDF.printSchema() root |-- forecasts: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- blurb: string (nullable = true) | | |-- blurb_author: string (nullable = true) | | |-- class: string (nullable = true) | | |-- day: struct (nullable = true) | | | |-- accumulation_phrase: string (nullable = true) | | | |-- alt_daypart_name: string (nullable = true) | | | |-- clds: long (nullable = true) | | | |-- day_ind: string (nullable = true) | | | |-- daypart_name: string (nullable = true) . . . . __28. Create a temporary view of this data. weatherDF.createOrReplaceTempView("weather") __29. Create a new DataFrame to hold the results of a query over this view. The example below extracts the transaction ID from the metadata associated with each JSON record. (My test scenario contains 2 JSON records.) val idDF = spark.sql("select metadata.transaction_id from weather")
  • 30. 30 Apart from the JSON path information that specifies the data of interest within each record, there’s nothing different about this SQL query from any other Spark SQL query. __30. Display the results. idDF.show(2,100) scala> idDF.show(2,100) +------------------------+ | transaction_id| +------------------------+ |1459805077803:1340092112| | 1455739258736:810296662| +------------------------+ __31. Execute a more selective query that accesses deeply nested JSON objects. This example retrieves the longitude and latitude, the valid date/time for the night’s forecast for the first element in the forecast array (forecasts[0].night.fcst_valid_local), and the short description of that night’s forecast (forecasts[0].night.shortcast). sql("select metadata.longitude, metadata.latitude, forecasts[0].night.fcst_valid_local, forecasts[0].night.shortcast from weather").show() scala> sql("select metadata.longitude, metadata.latitude, forecasts[0].night.fcst_valid_local, forecasts[0].night.shortcast from weather").show() +---------+--------+-----------------------------------+----------------------------+ |longitude|latitude|forecasts[0].night.fcst_valid_local|forecasts[0].night.shortcast| +---------+--------+-----------------------------------+----------------------------+ | -121.75| 37.17| 2016-04-04T19:00:...| Mainly clear| | 84.5| 37.17| 2016-02-17T19:00:...| Partly cloudy| +---------+--------+-----------------------------------+----------------------------+ Next, query your Big SQL data using Spark. This approach should be very familiar to you by now. __32. Create a DataFrame for the Big SQL table (geolookup) that you intend to join with the JSON data. val geoDF = spark.read.format("jdbc").option("url", "jdbc:db2://myhost- master.fyre.ibm.com:32051/BIGSQL").option("dbtable","bigsql.geolookup").option("user ", "bigsql").option("password", "bigsql").load() __33. Create a temporary view. geoDF.createOrReplaceTempView("geolookup") __34. Verify that you can query this view. Given how I populated the geolookup table earlier, the correct result from this query should be 3.
  • 31. 31 spark.sql("select count(*) from geolookup").show() __35. Optionally, display the contents of geolookup. spark.sql("select * from geolookup").show(5,100) scala> spark.sql("select * from geolookup").show(5,100) +------------+----------+------------------------+ | LONGITUDE| LATITUDE| LOCATION| +------------+----------+------------------------+ |-121.7500000|37.1700000| San Jose, CA USA| | -73.9902460|40.7301710|IBM Astor Place, NYC USA| | 84.5000000|37.1700000| Xinjiang, China| +------------+----------+------------------------+ __36. Join the JSON weather forecast data with your Big SQL data. The first 3 columns of this query’s result set are derived from the JSON data, while the final column’s data is derived from the Big SQL table. As you’ll note, longitude and latitude data in both source data sets serve as the join keys. spark.sql("select w.metadata.longitude, w.metadata.latitude, w.forecasts[0].narrative, g.location from weather w, geolookup g where w.metadata.longitude=g.longitude and metadata.latitude=g.latitude").show() scala> spark.sql("select w.metadata.longitude, w.metadata.latitude, w.forecasts[0].narrative, g.location from weather w, geolookup g where w.metadata.longitude=g.longitude and metadata.latitude=g.latitude").show() +---------+--------+----------------------+----------------+ |longitude|latitude|forecasts[0].narrative| location| +---------+--------+----------------------+----------------+ | 84.5| 37.17| Partly cloudy. Lo...| Xinjiang, China| | -121.75| 37.17| Abundant sunshine...|San Jose, CA USA| +---------+--------+----------------------+----------------+
  • 32. 32 Lab 3 Summary In this lab, you explored one way of using Spark to work with data in Big SQL tables stored in the Hive warehouse, in HBase and in an arbitrary HDFS directory. Through Spark SQL’s support for JDBC data sources, you queried these tables and even invoked a simple Spark MLlib transformative operation against data in one of these tables. Finally, if you completed the optional exercise, you saw how easy it can be to join data in JSON files with data in Big SQL tables using Spark SQL. To expand your skills and learn more, enroll in free online courses offered by Big Data University (http://www.bigdatauniversity.com/) or work through free tutorials included in the BigInsights product documentation. The HadoopDev web site (https://developer.ibm.com/hadoop/) contains links to these and other resources.
  • 33. 33 © Copyright IBM Corporation 2017. Written by C. M. Saracco. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. This information is based on current IBM product plans and strategy, which are subject to change by IBM without notice. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.