Deep Dive into the New Features of Apache Spark 3.1

Deep Dive into the New Features of
Apache Spark 3.1
Data + AI Summit 2021
Xiao Li gatorsmile
Wenchen Fan cloud-fan

5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS

• The Spark team at
• Apache Spark Committer and PMC members
About Us
Xiao Li (Github: gatorsmile)
Wenchen Fan (Github: cloud-fan)

Execution
ANSI Compliance
Python
Performance More
Streaming
Node
Decommissioning
Create Table
Syntax
Sub-expression
Elimination
Spark on K8S
GA
Runtime Error Partition
Pruning
Explicit Cast Char/Varchar
Predicate
Pushdown
Shuffle Hash
Join
Shuffle
Removal
Nested Field
Pruning
Catalog APIs
for JDBC
Ignore
Hints
Stream-
stream Join
New Doc for
PySpark
History Server
Support for SS
Streaming
Table APIs
Python Type
Support
Dependency
Management
Installation
Option for PyPi
Search Function
in Spark Doc
State Schema
Validation
Stage-level
Scheduling
Apache Spark 3.1

Overflow
Checking
Create Table
Syntax
Explicit Cast
CHAR/VARCHAR
ANSI SQL Compliance

Fail Earlier for Invalid Data
In Spark 3.0 we added the overflow check, in 3.1 more checks are added
Apache Spark 3.1

Forbid Confusing CAST
Apache Spark 3.1

ANSI Mode GA in Spark 3.2
• ANSI implicit CAST
• More runtime failures for invalid input
• No-failure alternatives: TRY_CAST, TRY_DIVIDE, TRY_ADD, …
• ...
In Development

Unified CREATE TABLE SQL Syntax
Apache Spark 3.1
CREATE TABLE t (col INT) USING parquet OPTIONS …
CREATE TABLE t (col INT) STORED AS parquet SERDEPROPERTIES …
// L This creates a Hive text table, super slow!
CREATE TABLE t (col INT)
// After Spark 3.1 …
SET spark.sql.legacy.createHiveTableByDefault=false
// J Now this creates native parquet table
CREATE TABLE t (col INT)

CHAR/VARCHAR Support
Apache Spark 3.1
Table insertion fails at runtime if the input string length exceeds the CHAR/VARCHAR
length

CHAR/VARCHAR Support
Apache Spark 3.1
CHAR type values will be padded to
the length.
Mostly for ANSI compatibility and
recommend to use VARCHAR.

More ANSI Features
Apache Spark 3.1
• Unify SQL temp view and permanent view behaviors (SPARK-33138)
• Re-parse and analyze the view SQL string when reading the view
• Support column list in INSERT statement (SPARK-32976)
• INSERT INTO t(col2, col1) VALUES …
• Support ANSI nested bracketed comments (SPARK-28880)
• ...

More ANSI Features Coming in Spark 3.2!
• ANSI year-month and day-time INTERVAL date types
• Comparable and persist-able
• ANSI TIMESTAMP WITHOUT TIMEZONE type
• Simplify the timestamp handling
• A new decorrelation framework for correlated subquery
• Support outer reference in more places
• LATERAL JOIN
• FROM t1 JOIN LATERAL (SELECT t1.col + t2.col FROM t2)
• SQL error code
• More searchable, cross language, JDCB compatible
In Development

Node Decommissioning
Apache Spark 3.1
Gracefully handle scheduled executor shutdown
• Auto-scaling: Spark decides to shut down one or more idle executors.
• EC2 spot instances: executor gets notified when it’s going to be killed soon.
• GCE preemptable instances: same as above
• YARN & Kubernetes: kill containers with notification for higher priority tasks.

Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other live executors.
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown

Apache Spark 3.1
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown
2. notify driver
about this
2. migrate data
2. migrate data

Apache Spark 3.1
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown
2. notify driver
about this
2. migrate data
2. migrate data
3. stop scheduling
tasks to executor 1

Summary
Apache Spark 3.1
• Migrate data blocks to other nodes before shutdown, to avoid recomputing
later.
• Stop scheduling tasks on the decommissioning node as they likely can’t
complete and waste resources.
• Launch speculative tasks for tasks running on the decommissioning node that
likely can’t complete.

Shuffle
Hash Join
Partition
Pruning
Predicate
Pushdown
Compile Latency
Reduction
SQL Performance

Shuffle Hash Join Improvement
Spark prefers Sort Merge Join over Shuffle Hash Join to avoid OOM.
Apache Spark 3.1
Build Side
Probe Side
Partition 0
Partition 1
Partition 2
…
Partition 0
Partition 1
Partition 2
…
partition by join keys
partition by join keys
Hash Table
row1
row2
row3
row4
…
look up and join

Shuffle Hash Join Improvement
Apache Spark 3.1
Makes Shuffle Hash Join on-par with Sort Merge Join and Broadcast Hash Join
• Add code-gen for shuffled hash join (SPARK-32421)
• Support full outer join in shuffled hash join (SPARK-32399)
• Add handling for unique key in non-codegen hash join (SPARK-32420)
• Preserve shuffled hash join build side partitioning (SPARK-32330)
• ...

Partition Pruning Improvement
Partition pruning is critical for file scan performance
Apache Spark 3.1
Catalog
1. send partition
predicates
pushed filters
files list
2. get pruned
file list
file scan operator
task1 task2 task3 …
3. launch Spark
tasks to read files

Partition Pruning Improvement
Apache Spark 3.1
Pushdown more partition predicates
• Support Contains, StartsWith and EndsWith in partition pruning (SPARK-33458)
• Support date type in partition pruning (SPARK-33477)
• Support not-equals in partition pruning (SPARK-33582)
• Support NOT IN in partition pruning (SPARK-34538)
• ...

Predicate Pushdown Improvement
Apache Spark 3.1
Join condition mixed with columns from both sides
• Can push: FROM t1 JOIN t2 ON t1.key = 1 AND t2.key = 2
• Cannot push: FROM t1 JOIN t2 ON t1.key = 1 OR t2.key = 2
• Cannot push: FROM t1 JOIN t2 ON
(t1.key = 1 AND t2.key = 2) OR t1.key = 3
Predicates mixed with data columns and partition columns have similar issues.

Predicate Pushdown Improvement
Apache Spark 3.1
Conjunctive Normal Form (CNF)
(a1 AND a2) OR (b1 AND b2) ->
(a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2)
Push down more predicates, less disk IO.

Reduce Query Compiling Latency (3.2)
Apache Spark 3.1
Optimize for short queries
A major improvement of the catalyst framework
Stay tuned!

Structured Streaming
> 120 trillion
records/day processed on Databricks
with Structured Streaming
Stream-stream
Join
History Server
Support for SS
Streaming
Table APIs
RocksDB
State Store (3.2)
2.5 x
(the past year)
12 x
(since 2 years ago)
Growth on Databricks
with Structured Streaming

Streaming Table APIs
The APIs to read/write continuous data streams as unbounded tables
Apache Spark 3.1
input = spark.readStream
.format("rate")
.option("rowsPerSecond", 10)
.load()
input.writeStream
.option("checkpointLocation",
"path/to/checkpoint/dir1")
.format("delta")
.toTable("myStreamTable")

Streaming Table APIs
The APIs to read/write continuous data streams as unbounded tables
Apache Spark 3.1
spark.readStream
.table("myStreamTable")
.select("value")
.writeStream
.option("checkpointLocation",
"path/to/checkpoint/dir2")
.format("delta")
.toTable("newStreamTable")
// Show the current snapshot
spark.read
.table("myStreamTable")
.show(false)

Stream-stream Join
Apache Spark 3.1
New in 3.1
New in 3.1

Monitoring Structured Streaming
Structured streaming support in Spark History Server is added in 3.1 release.
Apache Spark 3.1

State Store for Structured Streaming
State Store backed operations. Examples:
• Stateful aggregations
• Drop duplicates
• Stream-stream joins
Maintaining immediate state data across batches.
• High availability
• Fault tolerance

State Store for Structured Streaming
The metrics of the state store are added to Live UI and Spark History Server
Apache Spark 3.1

State Store
HDFS-backed State Store (the default in Spark 3.1):
• Each executor has hashmap(s) containing versioned data
• Updates committed as delta files in HDFS
• Delta files periodically collapsed into snapshots to improve recovery
Drawbacks:
• The quantity of states that can be maintained is limited by the heap size of the executors.
• State expiration by watermark and/or timeouts require full scans over all the data.
Spark cluster
Driver Executor
Executor
hash
map
hash
map
delta files
in HDFS
k1 v1
k2 v2
k1 v1
k2 v2

RocksDB State Store
In Development
RocksDB state store (default in Databricks)
• RocksDB can serve data from the disk with a configurable amount of non-JVM memory.
• Sorting keys using the appropriate column should avoid full scans to find the to-be-dropped keys.

PySpark
2020
Python
47%
SQL
41%
Scala and
others
12%
Python Type
Support
Dependency
Management
Visualization
(3.2)
pandas APIs
(3.2)

Add the type hints [PEP 484] to PySpark!
§ Simple to use
▪ Autocompletion in IDE/Notebook. For example, in Databricks notebook,
▪ Display Python docstring hints by pressing Shift+Tab
▪ Display a list of valid completions by pressing Tab
Apache Spark 3.1

Add the type hints [PEP 484] to PySpark!
§ Simple to use
▪ Autocompletion in IDE/Notebook
§ Richer API documentation
▪ Automatic inclusion of input and output types
§ Better code quality
▪ Static analysis in IDE, error detection, type mismatch, etc.
▪ Running mypy on CI systems.
Apache Spark 3.1

Static Error Detection
Apache Spark 3.1

Python Dependency Management
Supported environments
• Conda
• Venv (Virtualenv)
• PEX
Apache Spark 3.1
worker
worker
worker
driver
tar
tar
tar
Blog post: How to Manage Python Dependencies in
PySpark http://tinyurl.com/pysparkdep

Announced April 24, 2019
Pure Python library
Familiar if coming from pandas
§ Aims at providing the pandas API on top
of Spark
§ Unifies the two ecosystems with a
familiar API
§ Seamless transition between small and
large data
Koalas

Koalas Growth (from 0 to 84 Countries)
> 2 million
imports per month
on Databricks
~ 3 million
PyPI downloads per month

pandas APIs
In Development
import pandas as pd
df = pd.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
import databricks.koalas as ks
df = ks.read_csv(file)
df.describe()
df.plot.line(...)

pandas APIs
In Development
import pyspark.pandas as ps
df = ps.read_csv(file)
df.describe()
df.plot.line(...)
import pandas as pd
df = pd.read_csv(file)
df.describe()
df.plot.line(...)
import databricks.koalas as ks
df = ks.read_csv(file)
df.describe()
df.plot.line(...)

Visualization and Plotting
In Development
import pyspark.pandas as ps
df = ps.DataFrame({
'a': [1, 2, 2.5, 3, 3.5, 4, 5],
'b': [1, 2, 3, 4, 5, 6, 7],
'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
df.plot.hist()

Visualization and Plotting
In Development

Error
Messages
Date-Time
Utility Func.
Ignore Hints
EXPLAIN
Usability Enhancements

New Utility Functions for Unix Time
timestamp_seconds: Creates timestamp from the
number of seconds (can be fractional) since UTC epoch.
> SELECT timestamp_seconds(1230219000);
2008-12-25 07:30:00
unix_seconds: Returns the number of seconds since
1970-01-01 00:00:00 UTC (UTC epoch).
> SELECT unix_seconds(TIMESTAMP('1970-01-01
00:00:01Z'));
1
Apache Spark 3.1
Also, timestamp_millis, timestamp_micros, unix_millis and unix_micros
date_from_unix_date: Create date from the
number of days since 1970-01-01.
> SELECT date_from_unix_date(1);
1970-01-02
unix_date: Returns the number of days since
1970-01-01.
> SELECT unix_date(DATE('1970-01-02'));
1

Name Version Description Example
unix_seconds
unix_micros
unix_millis
3.1 Returns the number of
seconds/microseconds/milliseconds
since 1970-01-01 00:00:00 UTC.
-- input: Timestamp => output: Long
> SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z'));
1
to_unix_timestamp
unix_timestamp
(not recommended, if
the result is not for the
current timestamp)
1.6
1.5
Returns the number of seconds since
1970-01-01 00:00:00 UTC.
-- input: String => output: Long
> SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460098800
> SELECT unix_timestamp(); -- return the current timestamp
1460041200
> SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460041200
timestamp_seconds 3.1 Creates timestamp from the number of
seconds (can be fractional) since 1970-
01-01 00:00:00 UTC.
-- input: Long => output: Timestamp
> SELECT timestamp_seconds(1230219000);
2008-12-25 07:30:00
from_unixtime 1.5 Returns a timestamp string in the
specified format. The timestamp is
converted from the input number of
seconds elapsed since 1970-01-01
00:00:00 UTC.
-- input: Long => output: Sting
> SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');
1969-12-31 16:00:00

New Utility Functions for Time Zone
current_timezone: Returns the current session local timezone.
> SELECT current_timezone();
Asia/Shanghai
SET TIME ZONE: Sets the local time zone of the current session.
-- Set time zone to the system default.
SET TIME ZONE LOCAL;
-- Set time zone to the region-based zone ID.
SET TIME ZONE 'America/Los_Angeles';
-- Set time zone to the Zone offset.
SET TIME ZONE '+08:00';
Apache Spark 3.1
Upcoming new data type: Timestamp Without Time Zone

EXPLAIN FORMMATTED
§ Spark UI uses EXPLAIN FORMMATTED by default
§ New Format for Adaptive Query Execution
Apache Spark 3.1
The final plan after adaptive optimization
The plan after cost-based optimization

Ignore Hints
Apache Spark 3.1
• Is it FASTER ?
• Caused OOM?

Ignore Hints
Apache Spark 3.1
SET spark.sql.optimizer.disableHints=true

Standardize Error Messages
§ Group exception messages in dedicated files for easy maintenance and
auditing. [SPARK-33539]
§ Improve error messages quality
▪ Establish an error messages guideline for developers
https://spark.apache.org/error-message-guidelines.html
§ Add language-agnostic, locale-agnostic identifiers/SQLSTATE
SPK-10000 MISSING_COLUMN_ERROR: cannot resolve 'bad_column' given input columns:
[a, b, c, d, e]; (SQLSTATE 42704)
In Development

New Doc for
PySpark
Installation
Option for PyPi
Search Function
in Spark Doc
JAVA 17
（3.2）
Documentation and Environments

New Doc for PySpark
Apache Spark 3.1
https://spark.apache.org/docs/latest/api/python/index.html

Search Function in Spark Doc
Apache Spark 3.1

Installation Option for PyPI
Apache Spark 3.1
Default Hadoop version is changed to Hadoop 3.2.

Deprecations and Removals
§ Drop Python 2.7, 3.4 and 3.5 (SPARK-32138)
§ Drop R < 3.5 support (SPARK-32073)
§ Remove hive-1.2 distribution (SPARK-32981)
In the upcoming Spark 3.2
§ Deprecate support of Mesos (SPARK-35050)
Apache Spark 3.1

ANSI SQL Compliance
Python
Performance More
Streaming
Decorrelation
Framework
Timestamp
w/o Time Zone
Adaptive
Optimization
Scala 2.13
Beta
Error Code Implicit Type
Cast
Interval Type
Complex Type
Support in ORC
Lateral Join
Compile Latency
Reduction
Low-latency
Scheduler
JAVA 17
Push-based
Shuffle
Session
Window
Visualization
and Plotting
RocksDB
State Store
Queryable
State Store
Pythonic Error
Handling
Richer
Input/Output
pandas APIs
Parquet 1.12
(Column Index)
State Store
APIs
DML Metrics
ANSI Mode
GA
In Development

Thank you for your
contributions!

Xiao Li (lixiao @ databricks)
Wenchen Fan (wenchen @ databricks)

Deep Dive into the New Features of Apache Spark 3.1

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Deep Dive into the New Features of Apache Spark 3.1

Similaire à Deep Dive into the New Features of Apache Spark 3.1 (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Deep Dive into the New Features of Apache Spark 3.1