Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.
The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.
3. • The Spark team at
• Apache Spark Committer and PMC members
About Us
Xiao Li (Github: gatorsmile)
Wenchen Fan (Github: cloud-fan)
4. Execution
ANSI Compliance
Python
Performance More
Streaming
Node
Decommissioning
Create Table
Syntax
Sub-expression
Elimination
Spark on K8S
GA
Runtime Error Partition
Pruning
Explicit Cast Char/Varchar
Predicate
Pushdown
Shuffle Hash
Join
Shuffle
Removal
Nested Field
Pruning
Catalog APIs
for JDBC
Ignore
Hints
Stream-
stream Join
New Doc for
PySpark
History Server
Support for SS
Streaming
Table APIs
Python Type
Support
Dependency
Management
Installation
Option for PyPi
Search Function
in Spark Doc
State Schema
Validation
Stage-level
Scheduling
Apache Spark 3.1
9. ANSI Mode GA in Spark 3.2
• ANSI implicit CAST
• More runtime failures for invalid input
• No-failure alternatives: TRY_CAST, TRY_DIVIDE, TRY_ADD, …
• ...
In Development
10. Unified CREATE TABLE SQL Syntax
Apache Spark 3.1
CREATE TABLE t (col INT) USING parquet OPTIONS …
CREATE TABLE t (col INT) STORED AS parquet SERDEPROPERTIES …
// L This creates a Hive text table, super slow!
CREATE TABLE t (col INT)
// After Spark 3.1 …
SET spark.sql.legacy.createHiveTableByDefault=false
// J Now this creates native parquet table
CREATE TABLE t (col INT)
12. CHAR/VARCHAR Support
Apache Spark 3.1
CHAR type values will be padded to
the length.
Mostly for ANSI compatibility and
recommend to use VARCHAR.
13. More ANSI Features
Apache Spark 3.1
• Unify SQL temp view and permanent view behaviors (SPARK-33138)
• Re-parse and analyze the view SQL string when reading the view
• Support column list in INSERT statement (SPARK-32976)
• INSERT INTO t(col2, col1) VALUES …
• Support ANSI nested bracketed comments (SPARK-28880)
• ...
14. More ANSI Features Coming in Spark 3.2!
• ANSI year-month and day-time INTERVAL date types
• Comparable and persist-able
• ANSI TIMESTAMP WITHOUT TIMEZONE type
• Simplify the timestamp handling
• A new decorrelation framework for correlated subquery
• Support outer reference in more places
• LATERAL JOIN
• FROM t1 JOIN LATERAL (SELECT t1.col + t2.col FROM t2)
• SQL error code
• More searchable, cross language, JDCB compatible
In Development
16. Node Decommissioning
Apache Spark 3.1
Gracefully handle scheduled executor shutdown
• Auto-scaling: Spark decides to shut down one or more idle executors.
• EC2 spot instances: executor gets notified when it’s going to be killed soon.
• GCE preemptable instances: same as above
• YARN & Kubernetes: kill containers with notification for higher priority tasks.
17. Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other live executors.
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown
18. Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other live executors.
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown
2. notify driver
about this
2. migrate data
2. migrate data
19. Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other live executors.
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown
2. notify driver
about this
2. migrate data
2. migrate data
3. stop scheduling
tasks to executor 1
20. Summary
Apache Spark 3.1
• Migrate data blocks to other nodes before shutdown, to avoid recomputing
later.
• Stop scheduling tasks on the decommissioning node as they likely can’t
complete and waste resources.
• Launch speculative tasks for tasks running on the decommissioning node that
likely can’t complete.
22. Shuffle Hash Join Improvement
Spark prefers Sort Merge Join over Shuffle Hash Join to avoid OOM.
Apache Spark 3.1
Build Side
Probe Side
Partition 0
Partition 1
Partition 2
…
Partition 0
Partition 1
Partition 2
…
partition by join keys
partition by join keys
Hash Table
row1
row2
row3
row4
…
look up and join
23. Shuffle Hash Join Improvement
Apache Spark 3.1
Makes Shuffle Hash Join on-par with Sort Merge Join and Broadcast Hash Join
• Add code-gen for shuffled hash join (SPARK-32421)
• Support full outer join in shuffled hash join (SPARK-32399)
• Add handling for unique key in non-codegen hash join (SPARK-32420)
• Preserve shuffled hash join build side partitioning (SPARK-32330)
• ...
24. Partition Pruning Improvement
Partition pruning is critical for file scan performance
Apache Spark 3.1
Catalog
1. send partition
predicates
pushed filters
files list
2. get pruned
file list
file scan operator
task1 task2 task3 …
3. launch Spark
tasks to read files
25. Partition Pruning Improvement
Partition pruning is critical for file scan performance
Apache Spark 3.1
Catalog
1. send partition
predicates
pushed filters
files list
2. get pruned
file list
file scan operator
task1 task2 task3 …
3. launch Spark
tasks to read files
26. Partition Pruning Improvement
Apache Spark 3.1
Pushdown more partition predicates
• Support Contains, StartsWith and EndsWith in partition pruning (SPARK-33458)
• Support date type in partition pruning (SPARK-33477)
• Support not-equals in partition pruning (SPARK-33582)
• Support NOT IN in partition pruning (SPARK-34538)
• ...
27. Predicate Pushdown Improvement
Apache Spark 3.1
Join condition mixed with columns from both sides
• Can push: FROM t1 JOIN t2 ON t1.key = 1 AND t2.key = 2
• Cannot push: FROM t1 JOIN t2 ON t1.key = 1 OR t2.key = 2
• Cannot push: FROM t1 JOIN t2 ON
(t1.key = 1 AND t2.key = 2) OR t1.key = 3
Predicates mixed with data columns and partition columns have similar issues.
28. Predicate Pushdown Improvement
Apache Spark 3.1
Conjunctive Normal Form (CNF)
(a1 AND a2) OR (b1 AND b2) ->
(a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2)
Push down more predicates, less disk IO.
29. Reduce Query Compiling Latency (3.2)
Apache Spark 3.1
Optimize for short queries
A major improvement of the catalyst framework
Stay tuned!
31. Structured Streaming
> 120 trillion
records/day processed on Databricks
with Structured Streaming
Stream-stream
Join
History Server
Support for SS
Streaming
Table APIs
RocksDB
State Store (3.2)
2.5 x
(the past year)
12 x
(since 2 years ago)
Growth on Databricks
with Structured Streaming
32. Streaming Table APIs
The APIs to read/write continuous data streams as unbounded tables
Apache Spark 3.1
input = spark.readStream
.format("rate")
.option("rowsPerSecond", 10)
.load()
input.writeStream
.option("checkpointLocation",
"path/to/checkpoint/dir1")
.format("delta")
.toTable("myStreamTable")
33. Streaming Table APIs
The APIs to read/write continuous data streams as unbounded tables
Apache Spark 3.1
spark.readStream
.table("myStreamTable")
.select("value")
.writeStream
.option("checkpointLocation",
"path/to/checkpoint/dir2")
.format("delta")
.toTable("newStreamTable")
// Show the current snapshot
spark.read
.table("myStreamTable")
.show(false)
36. State Store for Structured Streaming
State Store backed operations. Examples:
• Stateful aggregations
• Drop duplicates
• Stream-stream joins
Maintaining immediate state data across batches.
• High availability
• Fault tolerance
37. State Store for Structured Streaming
The metrics of the state store are added to Live UI and Spark History Server
Apache Spark 3.1
38. State Store
HDFS-backed State Store (the default in Spark 3.1):
• Each executor has hashmap(s) containing versioned data
• Updates committed as delta files in HDFS
• Delta files periodically collapsed into snapshots to improve recovery
Drawbacks:
• The quantity of states that can be maintained is limited by the heap size of the executors.
• State expiration by watermark and/or timeouts require full scans over all the data.
Spark cluster
Driver Executor
Executor
hash
map
hash
map
delta files
in HDFS
k1 v1
k2 v2
k1 v1
k2 v2
39. RocksDB State Store
In Development
RocksDB state store (default in Databricks)
• RocksDB can serve data from the disk with a configurable amount of non-JVM memory.
• Sorting keys using the appropriate column should avoid full scans to find the to-be-dropped keys.
42. Add the type hints [PEP 484] to PySpark!
§ Simple to use
▪ Autocompletion in IDE/Notebook. For example, in Databricks notebook,
▪ Display Python docstring hints by pressing Shift+Tab
▪ Display a list of valid completions by pressing Tab
Apache Spark 3.1
43. Add the type hints [PEP 484] to PySpark!
§ Simple to use
▪ Autocompletion in IDE/Notebook
§ Richer API documentation
▪ Automatic inclusion of input and output types
§ Better code quality
▪ Static analysis in IDE, error detection, type mismatch, etc.
▪ Running mypy on CI systems.
Apache Spark 3.1
45. Python Dependency Management
Supported environments
• Conda
• Venv (Virtualenv)
• PEX
Apache Spark 3.1
worker
worker
worker
driver
tar
tar
tar
Blog post: How to Manage Python Dependencies in
PySpark http://tinyurl.com/pysparkdep
46. Announced April 24, 2019
Pure Python library
Familiar if coming from pandas
§ Aims at providing the pandas API on top
of Spark
§ Unifies the two ecosystems with a
familiar API
§ Seamless transition between small and
large data
Koalas
47. Koalas Growth (from 0 to 84 Countries)
> 2 million
imports per month
on Databricks
~ 3 million
PyPI downloads per month
55. New Utility Functions for Unix Time
timestamp_seconds: Creates timestamp from the
number of seconds (can be fractional) since UTC epoch.
> SELECT timestamp_seconds(1230219000);
2008-12-25 07:30:00
unix_seconds: Returns the number of seconds since
1970-01-01 00:00:00 UTC (UTC epoch).
> SELECT unix_seconds(TIMESTAMP('1970-01-01
00:00:01Z'));
1
Apache Spark 3.1
Also, timestamp_millis, timestamp_micros, unix_millis and unix_micros
date_from_unix_date: Create date from the
number of days since 1970-01-01.
> SELECT date_from_unix_date(1);
1970-01-02
unix_date: Returns the number of days since
1970-01-01.
> SELECT unix_date(DATE('1970-01-02'));
1
56. Name Version Description Example
unix_seconds
unix_micros
unix_millis
3.1 Returns the number of
seconds/microseconds/milliseconds
since 1970-01-01 00:00:00 UTC.
-- input: Timestamp => output: Long
> SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z'));
1
to_unix_timestamp
unix_timestamp
(not recommended, if
the result is not for the
current timestamp)
1.6
1.5
Returns the number of seconds since
1970-01-01 00:00:00 UTC.
-- input: String => output: Long
> SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460098800
> SELECT unix_timestamp(); -- return the current timestamp
1460041200
> SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460041200
timestamp_seconds 3.1 Creates timestamp from the number of
seconds (can be fractional) since 1970-
01-01 00:00:00 UTC.
-- input: Long => output: Timestamp
> SELECT timestamp_seconds(1230219000);
2008-12-25 07:30:00
from_unixtime 1.5 Returns a timestamp string in the
specified format. The timestamp is
converted from the input number of
seconds elapsed since 1970-01-01
00:00:00 UTC.
-- input: Long => output: Sting
> SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');
1969-12-31 16:00:00
57. New Utility Functions for Time Zone
current_timezone: Returns the current session local timezone.
> SELECT current_timezone();
Asia/Shanghai
SET TIME ZONE: Sets the local time zone of the current session.
-- Set time zone to the system default.
SET TIME ZONE LOCAL;
-- Set time zone to the region-based zone ID.
SET TIME ZONE 'America/Los_Angeles';
-- Set time zone to the Zone offset.
SET TIME ZONE '+08:00';
Apache Spark 3.1
Upcoming new data type: Timestamp Without Time Zone
58. EXPLAIN FORMMATTED
§ Spark UI uses EXPLAIN FORMMATTED by default
§ New Format for Adaptive Query Execution
Apache Spark 3.1
The final plan after adaptive optimization
The plan after cost-based optimization
66. Installation Option for PyPI
Apache Spark 3.1
Default Hadoop version is changed to Hadoop 3.2.
67. Deprecations and Removals
§ Drop Python 2.7, 3.4 and 3.5 (SPARK-32138)
§ Drop R < 3.5 support (SPARK-32073)
§ Remove hive-1.2 distribution (SPARK-32981)
In the upcoming Spark 3.2
§ Deprecate support of Mesos (SPARK-35050)
Apache Spark 3.1
68. Execution
ANSI Compliance
Python
Performance More
Streaming
Node
Decommissioning
Create Table
Syntax
Sub-expression
Elimination
Spark on K8S
GA
Runtime Error Partition
Pruning
Explicit Cast Char/Varchar
Predicate
Pushdown
Shuffle Hash
Join
Shuffle
Removal
Nested Field
Pruning
Catalog APIs
for JDBC
Ignore
Hints
Stream-
stream Join
New Doc for
PySpark
History Server
Support for SS
Streaming
Table APIs
Python Type
Support
Dependency
Management
Installation
Option for PyPi
Search Function
in Spark Doc
State Schema
Validation
Stage-level
Scheduling
Apache Spark 3.1
69. ANSI SQL Compliance
Python
Performance More
Streaming
Decorrelation
Framework
Timestamp
w/o Time Zone
Adaptive
Optimization
Scala 2.13
Beta
Error Code Implicit Type
Cast
Interval Type
Complex Type
Support in ORC
Lateral Join
Compile Latency
Reduction
Low-latency
Scheduler
JAVA 17
Push-based
Shuffle
Session
Window
Visualization
and Plotting
RocksDB
State Store
Queryable
State Store
Pythonic Error
Handling
Richer
Input/Output
pandas APIs
Parquet 1.12
(Column Index)
State Store
APIs
DML Metrics
ANSI Mode
GA
In Development