A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.
Exploring iOS App Development: Simplifying the Process
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data
1. Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit Amsterdam 2015 - October, 28th
2. Graduated
from Alpha
in 1.3
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
SQLAbout Me and
2
0
100
200
300
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2
3. 3
SELECT COUNT(*)
FROM hiveTable
WHERE hive_udf(data)
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
SQLAbout Me and
Improved
multi-version
support in 1.4
4. 4
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
SQLAbout Me and
5. • Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
• Bindingsin Python,Scala, Java, and R
5
SQLAbout Me and
6. • Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
• Bindingsin Python,Scala, Java, and R
• @michaelarmbrust
• Creator of Spark SQL @databricks
6
SQLAbout Me and
8. 8
Create and run
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
9. DataFrame
noun – [dey-tuh-freym]
9
1. A distributed collection of rows organized into
named columns.
2. An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).
10. Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
10
11. Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
read and write
functions create
new builders for
doing I/O
11
12. Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
12
13. Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…)
finish the I/O
specification
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
13
14. Read Less Data: Efficient Formats
• Compact binary encoding with intelligent compression
(delta, RLE, etc)
• Each column stored separately with an index that allows
skipping of unread columns
• Support for partitioning (/data/year=2015)
• Data skipping using statistics (column min/max, etc)
14
is an efficient columnar storage format:
ORC
15. Write Less Code: Data Source API
Spark SQL’s Data Source API can read and write DataFrames
usinga variety of formats.
15
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ORCplain text*
16. ETL Using Custom Data Sources
sqlContext.read
.format("com.databricks.spark.jira")
.option("url", "https://issues.apache.org/jira/rest/api/latest/search")
.option("user", "marmbrus")
.option("password", "*******")
.option("query", """
|project = SPARK AND
|component = SQL AND
|(status = Open OR status = "In Progress" OR status = Reopened)""".stripMargin)
.load()
16
.repartition(1)
.write
.format("parquet")
.saveAsTable("sparkSqlJira")
Load
data
from
(Spark’s
Bug
Tracker)
using
a
custom
data
source.
Write
the
converted
data
out
to
a
table
stored
in
17. Write Less Code: High-Level Operations
Solve common problems conciselyusing DataFrame functions:
• Selecting columnsand filtering
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Plotting resultswith Pandas
17
18. Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
18
19. Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.map(lambda …)
.collect()
Full API Docs
• Python
• Scala
• Java
• R
19
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name
20. Not Just Less Code, Faster Too!
20
0 2 4 6 8 10
RDDScala
RDDPython
DataFrameScala
DataFramePython
DataFrameR
DataFrameSQL
Time to Aggregate 10 million int pairs (secs)
21. Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQLshare the same optimization/execution pipeline
22. Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity = udf(lambda zipCode: <custom logic here>)
def add_demographics(events):
u = sqlCtx.table("users")
events
.join(u, events.user_id == u.user_id)
.withColumn("city", zipToCity(df.zip))
Augments any
DataFrame
that contains
user_id
22
23. Optimize Full Pipelines
Optimization happensas late as possible, therefore
Spark SQL can optimize even across functions.
23
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events
.where(events.city == "Amsterdam")
.select(events.timestamp)
.collect()
24. 24
def add_demographics(events):
u = sqlCtx.table("users") # Load Hive table
events
.join(u, events.user_id == u.user_id) # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()
Logical Plan
filter
bycity
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
24
25. 25
def add_demographics(events):
u = sqlCtx.table("users") # Load partitioned Hive table
events
.join(u, events.user_id == u.user_id) # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet"))
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()
Logical Plan
filter
bycity
join
events file users table
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
25
26. Machine Learning Pipelines
26
tokenizer = Tokenizer(inputCol="text",
outputCol="words”)
hashingTF = HashingTF(inputCol="words",
outputCol="features”)
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
df = sqlCtx.load("/path/to/data")
model = pipeline.fit(df)
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model
27. • 100+ native functionswith
optimized codegen
implementations
– String manipulation – concat,
format_string, lower, lpad
– Data/Time – current_timestamp,
date_format, date_add, …
– Math – sqrt, randn, …
– Other –
monotonicallyIncreasingId,
sparkPartitionId, …
27
Rich Function Library
from pyspark.sql.functions import *
yesterday = date_sub(current_date(), 1)
df2 = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(), 1)
val df2 = df.filter(df("created_at") > yesterday)
Added
in
Spark
1.5
32. • Type-safe: operate on domain
objectswith compiled lambda
functions
• Fast: Code-generated
encodersfor fast serialization
• Interoperable: Easily convert
DataFrame ßà Dataset
withoutboilerplate
32
Coming soon: Datasets
val df = ctx.read.json("people.json")
// Convert data to domain objects.
case class Person(name: String, age: Int)
val ds: Dataset[Person] = df.as[Person]
ds.filter(_.age > 30)
// Compute histogram of age by name.
val hist = ds.groupBy(_.name).mapGroups {
case (name, people: Iter[Person]) =>
val buckets = new Array[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
Preview
in
Spark
1.6
33. 33
Create and run your
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
Questions?
Committer Office Hours
Weds, 4:00-5:00 pm Michael
Thurs, 10:30-11:30 am Reynold
Thurs, 2:00-3:00 pm Andrew