TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
Learning spark ch09 - Spark SQL
1. C H A P T E R 0 9 : S P A R K S Q L .
Learning Spark
by Holden Karau et. al.
2. Overview: SPARK SQL
Linking with Spark SQL
Using Spark SQL in Applications
Initializing Spark SQL
Basic Query Example
SchemaRDDs
Caching
Loading and Saving Data
Apache Hive
Parquet
JSON
From RDDs
JDBC/ODBC Server
Working with Beeline
Long-Lived Tables and Queries
User-Defined Functions
Spark SQL UDFs
Hive UDFs
Spark SQL Performance
Performance Tuning Options
3. 9.1. Linking with Spark SQL
Spark SQL can be built with or without Apache Hive, the Hadoop
SQL engine. Spark SQL with Hive support allows us to access Hive
tables, UDFs (user-defined functions), SerDes (serialization and
deserialization formats), and the Hive query language (HiveQL).
Hive query language (HQL) It is important to note that including
the Hive libraries does not require an existing Hive installation. In
general, it is best to build Spark SQL with Hive support to access
these features. If you download Spark in binary form, it should
already be built with Hive support
4. 9.2. Using Spark SQL in Applications
The most powerful way to use Spark SQL is inside a Spark
application. This gives us the power to easily load data and query it
with SQL while simultaneously combining it with “regular” program
code in Python, Java, or Scala.
To use Spark SQL this way, we construct a HiveContext (or
SQLContext for those wanting a stripped-down version) based on
our SparkContext. This context provides additional functions for
querying and interacting with Spark SQL data. Using the
HiveContext, we can build SchemaRDDs, which represent our
structure data, and operate on them with SQL or with normal RDD
operations like map().
5. Edx and Coursera Courses
Introduction to Big Data with Apache Spark
Spark Fundamentals I
Functional Programming Principles in Scala
6. 9.2.1 Initializing Spark SQL
Scala SQL imports
// Import Spark SQL
import org.apache.spark.sql.hive.HiveContext
// Or if you can't have the hive dependencies
import org.apache.spark.sql.SQLContext
Scala SQL implicits
// Create a Spark SQL HiveContext
val hiveCtx = ...
// Import the implicit conversions
import hiveCtx._
Constructing a SQL context in Scala
val sc = new SparkContext(...)
val hiveCtx = new HiveContext(sc)
7. 9.2.2 Basic Query Example
To make a query against a table, we call the sql() method on
the HiveContext or SQLContext. The first thing we need to
do is tell Spark SQL about some data to query. In this case
we will load some Twitter data from JSON, and give it a
name by registering it as a “temporary table” so we can
query it with SQL
Basic Query Example
Loading and quering tweets in Scala
val input = hiveCtx.jsonFile(inputFile)
// Register the input schema RDD
input.registerTempTable("tweets")
// Select tweets based on the retweetCount
val topTweets = hiveCtx.sql("SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10")
8. 9.2.3 SchemaRDDs
Both loading data and executing queries return SchemaRDDs.
SchemaRDDs are similar to tables in a traditional database. Under
the hood, a SchemaRDD is an RDD composed of Row objects with
additional schema information of the types in each column.
Row objects are just wrappers around arrays of basic types (e.g.,
integers and strings), and we’ll cover them in more detail in the next
section.
One important note: in future versions of Spark, the name
SchemaRDD may be changed to DataFrame. This renaming was still
under discussion as this book went to print. SchemaRDDs are also
regular RDDs, so you can operate on them using existing RDD
transformations like map() and filter(). However, they provide
several additional capabilities. Most importantly, you can register
any SchemaRDD as a temporary table to query it via
HiveContext.sql or SQLContext.sql. You do so using the
SchemaRDD’s registerTempTable() method,
10. SchemaRDDs
Row objects represent records inside SchemaRDDs In
Scala/Java, Row objects have a number of getter
functions to obtain the value of each field given its index
Accessing the text column (also first column) in the
topTweets SchemaRDD in Scala
val topTweetText = topTweets.map(row => row.getString(0))
11. 9.2.4 Caching
Caching in Spark SQL works a bit differently. Since we
know the types of each column, Spark is able to more
efficiently store the data. To make sure that we cache
using the memory efficient representation, rather than
the full objects, we should use the special
hiveCtx.cacheTable("tableName") method. When
caching a table Spark SQL represents the data in an in-
memory columnar format. This cached table will remain
in memory only for the life of our driver program, so if it
exits we will need to recache our data. As with RDDs, we
cache tables when we expect to run multiple tasks or
queries against the same data.
12. 9.3. Loading and Saving Data
Spark SQL supports a number of structured data sources out of the
box, letting you get Row objects from them without any complicated
loading process. These sources include Hive tables, JSON, and
Parquet files. In addition, if you query these sources using SQL and
select only a subset of the fields, Spark SQL can smartly scan only
the subset of the data for those fields, instead of scanning all the
data like a naive SparkContext.hadoopFile might.
Apart from these data sources, you can also convert regular RDDs in
your program to SchemaRDDs by assigning them a schema. This
makes it easy to write SQL queries even when your underlying data
is Python or Java objects. Often, SQL queries are more concise
when you’re computing many quantities at once (e.g., if you wanted
to compute the average age, max age, and count of distinct user IDs
in one pass). In addition, you can easily join these RDDs with
SchemaRDDs from any other SparkSQL data source. In this section,
we’ll cover the external sources as well as this way of using RDDs.
13. 9.3.1 Apache Hive
When loading data from Hive, Spark SQL supports
any Hive-supported storage formats(SerDes),
including text files, RCFiles, ORC, Parquet, Avro,
and Protocol Buffers.
Hive load in Scala
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc)
val rows = hiveCtx.sql("SELECT key, value FROM mytable")
val keys = rows.map(row => row.getInt(0))
14. 9.3.2 Parquet
Parquet is a popular column-oriented storage format that can store
records with nested fields efficiently. It is often used with tools in
the Hadoop ecosystem, and it supports all of the data types in Spark
SQL. Spark SQL provides methods for reading data directly to and
from Parquet files.
Parquet load in Python
# Load some data in from a Parquet file with field's name and favouriteAnimal
rows = hiveCtx.parquetFile(parquetFile)
names = rows.map(lambda row: row.name)
print "Everyone"
print names.collect()
Parquet query in Python
# Find the panda lovers
tbl = rows.registerTempTable("people")
pandaFriends = hiveCtx.sql("SELECT name FROM people WHERE
favouriteAnimal
15. 9.3.3 JSON
If you have a JSON file with records fitting the same schema,
Spark SQL can infer the schema by scanning the file and let
you access fields by name
If you have ever found yourself staring at a huge directory of
JSON records, Spark SQL’s schema inference is a very
effective way to start working with the data without writing
any special loading code.
To load our JSON data, all we need to do is call the jsonFile()
function on our hiveCtx.
Input records
{"name": "Holden"}
{"name":"Sparky The Bear", "lovesPandas":true,
"knows":{"friends": ["holden"]}}
Loading JSON with Spark SQL in Scala
val input = hiveCtx.jsonFile(inputFile)
17. 9.3.4 From RDDs
In addition to loading data, we can also create a SchemaRDD
from an RDD. In Scala,RDDs with case classes are implicitly
converted into SchemaRDDs.
For Python we create an RDD of Row objects and then call
inferSchema().
Creating a SchemaRDD from case class in Scala
case class HappyPerson(handle: String, favouriteBeverage: String)
...
// Create a person and turn it into a Schema RDD
val happyPeopleRDD = sc.parallelize(List(HappyPerson("holden",
"coffee")))
// Note: there is an implicit conversion
// that is equivalent to sqlCtx.createSchemaRDD(happyPeopleRDD)
happyPeopleRDD.registerTempTable("happy_people")
18. 9.4. JDBC/ODBC Server
Spark SQL also provides JDBC connectivity
The JDBC server runs as a standalone Spark driver
program that can be shared by multiple clients. Any
client can cache tables in memory, query them, and so
on, and the cluster resources and cached data will be
shared among all of them
Launching the JDBC server
./sbin/start-thriftserver.sh --master sparkMaster
Connecting to the JDBC server with Beeline
holden@hmbp2:~/repos/spark$ ./bin/beeline -u
jdbc:hive2://localhost:10000
By default it listens on localhost:10000, but we can
change these with either environment variables
19. 9.4.1 Working with Beeline
Within the Beeline client, you can use standard
HiveQL commands to create, list, and query tables.
You can find the full details of HiveQL in the Hive
Language Manual.
Load table
CREATE TABLE IF NOT EXISTS mytable (key INT, value STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> LOAD DATA LOCAL INPATH 'learning-spark-examples/files/int_string.csv'
INTO TABLE mytable;
Show tables
SHOW TABLES;
mytable
Time taken: 0.052 seconds
20. 9.4.1 Working with Beeline
Beeline makes it easy to view query plans. You can
run EXPLAIN on a given query to see what the
execution plan will be
Spark SQL shell EXPLAIN
spark-sql> EXPLAIN SELECT * FROM mytable where key = 1;
== Physical Plan ==
Filter (key#16 = 1)
HiveTableScan [key#16,value#17], (MetastoreRelation default, mytable, None),
None
Time taken: 0.551 seconds
21. 9.4.2 Long-Lived Tables and Queries
One of the advantages of using Spark SQL’s JDBC
server is we can share cached tables between
multiple programs. This is possible since the JDBC
Thrift server is a single driver program. To do this,
you only need to register the table and then run the
CACHE command on it, as shown in the previous
section.
22. 9.5. User-Defined Functions
User-defined functions, or UDFs, allow you to
register custom functions in Python, Java, and Scala
to call within SQL. They are a very popular way to
expose advanced functionality to SQL users in an
organization, so that these users can call into it
without writing code. Spark SQL makes it especially
easy to write UDFs. It supports both its own UDF
interface and existing Apache Hive UDFs.
23. 9.5.1 Spark SQL UDFs
Spark SQL offers a built-in method to easily register
UDFs by passing in a function in your programming
language. In Scala and Python, we can use the native
function and lambda syntax of the language, and in
Java we need only extend the appropriate UDF class.
Our UDFs can work on a variety of types, and we can
return a different type than the one we are called
with.
Scala string length UDF
registerFunction("strLenScala", (_: String).length)
val tweetLength = hiveCtx.sql("SELECT strLenScala('tweet')
FROM tweets LIMIT 10")
24. 9.5.2 Hive UDFs
Spark SQL can also use existing Hive UDFs. The
standard Hive UDFs are already automatically included.
If you have a custom UDF, it is important to make sure
that the JARs for your UDF are included with your
application. If we run the JDBC server, note that we can
add this with the --jars command-line flag. Developing
Hive UDFs is beyond the scope of this book, so we will
instead introduce how to use existing Hive UDFs.
Using a Hive UDF requires that we use the HiveContext
instead of a regular SQLContext. To make a Hive UDF
available, simply call hiveCtx.sql("CREATE
TEMPORARY FUNCTION name AS class.function").
25. 9.6. Spark SQL Performance
As alluded to in the introduction, Spark SQL’s
higher-level query language and additional type
information allows Spark SQL to be more efficient.
Spark SQL is for more than just users who are
familiar with SQL. Spark SQL makes it very easy to
perform conditional aggregate operations, like
counting the sum of multiple columns without
having to construct special objects
Spark SQL multiple sums
SELECT SUM(user.favouritesCount), SUM(retweetCount),
user.id FROM tweets GROUP BY user.id
26. 9.6.1 Performance Tuning Options
Using the JDBC connector, and the Beeline shell, we can set these
performance options, and other options, with the set command
Beeline command for enabling codegen
beeline> set spark.sql.codegen=true;
SET spark.sql.codegen=true
spark.sql.codegen=true
Time taken: 1.196 seconds
Scala code for enabling codegen
conf.set("spark.sql.codegen", "true“)
First is spark.sql.codegen, which causes Spark SQL to compile each
query to Java bytecode before running it. Codegen can make long
queries or frequently repeated queries substantially faster
The second option you may need to tune is
spark.sql.inMemoryColumnarStor age.batchSize. When caching
SchemaRDDs, Spark SQL groups together the records in the RDD in
batches of the size given by this option (default: 1000), and compresses
each batch
27. Edx and Coursera Courses
Introduction to Big Data with Apache Spark
Spark Fundamentals I
Functional Programming Principles in Scala