A concentrated look at Apache Spark's library Spark SQL including background information and numerous Scala code examples of using Spark SQL with CSV, JSON and databases such as mySQL.
2. Background
• Spark SQL is Spark's module for
working with structured data.
• Spark SQL lets you query structured
data inside Spark programs, using
either SQL or a familiar DataFrame API.
Usable in Java, Scala, Python and R.
• Born out of Shark project at Berkeley
3. Assumptions
These slides and examples assume you
already have at least a basic understanding
of Spark constructs such as RDDs, Actions,
Transformers.
5. Introduction
• DataFrames are a kind of Resilient Distributed Data Set
• DataFrames are composed of Row objects accompanied
with schema which describes the data types of each
column.
• A DataFrame may be considered similar to a table in a
traditional relational database
6. 1. $SPARK_HOME/bin/spark-shell --packages
com.databricks:spark-csv_2.10:1.3.0
2. scala>val baby_names =
sqlContext.read.format("com.databricks.spark.csv").option("he
ader", "true").option("inferSchema",
“true").load("baby_names.csv")
3. scala> baby_names.registerTempTable(“names")
4. scala> val distinctYears = sqlContext.sql("select distinct Year
from names”)
5. scala> distinctYears.collect.foreach(println)
Spark SQL with CSV
7. JSON in following examples:
{"first_name":"James", "last_name":"Butterburg", "address":
{"street": "6649 N Blue Gum St", "city": "New Orleans","state":
"LA", "zip": "70116" }}
{"first_name":"Josephine", "last_name":"Darakjy", "address":
{"street": "4 B Blue Ridge Blvd", "city": "Brighton","state": "MI",
"zip": "48116" }}
{"first_name":"Art", "last_name":"Chemel", "address": {"street": "8
W Cerritos Ave #54", "city": "Bridgeport","state": "NJ", "zip":
"08014" }}
Spark SQL with JSON (slide 1 of 2)
8. 1. $SPARK_HOME/bin/spark-shell
2. scala> val customers =
sqlContext.jsonFile(“customers.json")
3. scala> customers.registerTempTable(“customers")
4. scala> val firstCityState = sqlContext.sql("SELECT
first_name, address.city, address.state FROM
customers")
Spark SQL with JSON (slide 2 of 2)