SlideShare une entreprise Scribd logo
1  sur  71
Télécharger pour lire hors ligne
Deep Dive into the New Features of
Apache Spark 3.1
Data + AI Summit 2021
Xiao Li gatorsmile
Wenchen Fan cloud-fan
5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS
• The Spark team at
• Apache Spark Committer and PMC members
About Us
Xiao Li (Github: gatorsmile)
Wenchen Fan (Github: cloud-fan)
Execution
ANSI Compliance
Python
Performance More
Streaming
Node
Decommissioning
Create Table
Syntax
Sub-expression
Elimination
Spark on K8S
GA
Runtime Error Partition
Pruning
Explicit Cast Char/Varchar
Predicate
Pushdown
Shuffle Hash
Join
Shuffle
Removal
Nested Field
Pruning
Catalog APIs
for JDBC
Ignore
Hints
Stream-
stream Join
New Doc for
PySpark
History Server
Support for SS
Streaming
Table APIs
Python Type
Support
Dependency
Management
Installation
Option for PyPi
Search Function
in Spark Doc
State Schema
Validation
Stage-level
Scheduling
Apache Spark 3.1
Overflow
Checking
Create Table
Syntax
Explicit Cast
CHAR/VARCHAR
ANSI SQL Compliance
Fail Earlier for Invalid Data
In Spark 3.0 we added the overflow check, in 3.1 more checks are added
Apache Spark 3.1
Forbid Confusing CAST
Apache Spark 3.1
Forbid Confusing CAST
Apache Spark 3.1
ANSI Mode GA in Spark 3.2
• ANSI implicit CAST
• More runtime failures for invalid input
• No-failure alternatives: TRY_CAST, TRY_DIVIDE, TRY_ADD, …
• ...
In Development
Unified CREATE TABLE SQL Syntax
Apache Spark 3.1
CREATE TABLE t (col INT) USING parquet OPTIONS …
CREATE TABLE t (col INT) STORED AS parquet SERDEPROPERTIES …
// L This creates a Hive text table, super slow!
CREATE TABLE t (col INT)
// After Spark 3.1 …
SET spark.sql.legacy.createHiveTableByDefault=false
// J Now this creates native parquet table
CREATE TABLE t (col INT)
CHAR/VARCHAR Support
Apache Spark 3.1
Table insertion fails at runtime if the input string length exceeds the CHAR/VARCHAR
length
CHAR/VARCHAR Support
Apache Spark 3.1
CHAR type values will be padded to
the length.
Mostly for ANSI compatibility and
recommend to use VARCHAR.
More ANSI Features
Apache Spark 3.1
• Unify SQL temp view and permanent view behaviors (SPARK-33138)
• Re-parse and analyze the view SQL string when reading the view
• Support column list in INSERT statement (SPARK-32976)
• INSERT INTO t(col2, col1) VALUES …
• Support ANSI nested bracketed comments (SPARK-28880)
• ...
More ANSI Features Coming in Spark 3.2!
• ANSI year-month and day-time INTERVAL date types
• Comparable and persist-able
• ANSI TIMESTAMP WITHOUT TIMEZONE type
• Simplify the timestamp handling
• A new decorrelation framework for correlated subquery
• Support outer reference in more places
• LATERAL JOIN
• FROM t1 JOIN LATERAL (SELECT t1.col + t2.col FROM t2)
• SQL error code
• More searchable, cross language, JDCB compatible
In Development
Node Decommissioning
Node Decommissioning
Apache Spark 3.1
Gracefully handle scheduled executor shutdown
• Auto-scaling: Spark decides to shut down one or more idle executors.
• EC2 spot instances: executor gets notified when it’s going to be killed soon.
• GCE preemptable instances: same as above
• YARN & Kubernetes: kill containers with notification for higher priority tasks.
Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other live executors.
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown
Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other live executors.
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown
2. notify driver
about this
2. migrate data
2. migrate data
Node Decommissioning
Apache Spark 3.1
Migrate RDD cache and shuffle blocks from executors going to be shut down to
other live executors.
executor 1
executor 2
executor 3
driver
Shutdown
Trigger
1. send a signal to
notify the shutdown
2. notify driver
about this
2. migrate data
2. migrate data
3. stop scheduling
tasks to executor 1
Summary
Apache Spark 3.1
• Migrate data blocks to other nodes before shutdown, to avoid recomputing
later.
• Stop scheduling tasks on the decommissioning node as they likely can’t
complete and waste resources.
• Launch speculative tasks for tasks running on the decommissioning node that
likely can’t complete.
Shuffle
Hash Join
Partition
Pruning
Predicate
Pushdown
Compile Latency
Reduction
SQL Performance
Shuffle Hash Join Improvement
Spark prefers Sort Merge Join over Shuffle Hash Join to avoid OOM.
Apache Spark 3.1
Build Side
Probe Side
Partition 0
Partition 1
Partition 2
…
Partition 0
Partition 1
Partition 2
…
partition by join keys
partition by join keys
Hash Table
row1
row2
row3
row4
…
look up and join
Shuffle Hash Join Improvement
Apache Spark 3.1
Makes Shuffle Hash Join on-par with Sort Merge Join and Broadcast Hash Join
• Add code-gen for shuffled hash join (SPARK-32421)
• Support full outer join in shuffled hash join (SPARK-32399)
• Add handling for unique key in non-codegen hash join (SPARK-32420)
• Preserve shuffled hash join build side partitioning (SPARK-32330)
• ...
Partition Pruning Improvement
Partition pruning is critical for file scan performance
Apache Spark 3.1
Catalog
1. send partition
predicates
pushed filters
files list
2. get pruned
file list
file scan operator
task1 task2 task3 …
3. launch Spark
tasks to read files
Partition Pruning Improvement
Partition pruning is critical for file scan performance
Apache Spark 3.1
Catalog
1. send partition
predicates
pushed filters
files list
2. get pruned
file list
file scan operator
task1 task2 task3 …
3. launch Spark
tasks to read files
Partition Pruning Improvement
Apache Spark 3.1
Pushdown more partition predicates
• Support Contains, StartsWith and EndsWith in partition pruning (SPARK-33458)
• Support date type in partition pruning (SPARK-33477)
• Support not-equals in partition pruning (SPARK-33582)
• Support NOT IN in partition pruning (SPARK-34538)
• ...
Predicate Pushdown Improvement
Apache Spark 3.1
Join condition mixed with columns from both sides
• Can push: FROM t1 JOIN t2 ON t1.key = 1 AND t2.key = 2
• Cannot push: FROM t1 JOIN t2 ON t1.key = 1 OR t2.key = 2
• Cannot push: FROM t1 JOIN t2 ON
(t1.key = 1 AND t2.key = 2) OR t1.key = 3
Predicates mixed with data columns and partition columns have similar issues.
Predicate Pushdown Improvement
Apache Spark 3.1
Conjunctive Normal Form (CNF)
(a1 AND a2) OR (b1 AND b2) ->
(a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2)
Push down more predicates, less disk IO.
Reduce Query Compiling Latency (3.2)
Apache Spark 3.1
Optimize for short queries
A major improvement of the catalyst framework
Stay tuned!
Structured Streaming
Structured Streaming
> 120 trillion
records/day processed on Databricks
with Structured Streaming
Stream-stream
Join
History Server
Support for SS
Streaming
Table APIs
RocksDB
State Store (3.2)
2.5 x
(the past year)
12 x
(since 2 years ago)
Growth on Databricks
with Structured Streaming
Streaming Table APIs
The APIs to read/write continuous data streams as unbounded tables
Apache Spark 3.1
input = spark.readStream
.format("rate")
.option("rowsPerSecond", 10)
.load()
input.writeStream
.option("checkpointLocation",
"path/to/checkpoint/dir1")
.format("delta")
.toTable("myStreamTable")
Streaming Table APIs
The APIs to read/write continuous data streams as unbounded tables
Apache Spark 3.1
spark.readStream
.table("myStreamTable")
.select("value")
.writeStream
.option("checkpointLocation",
"path/to/checkpoint/dir2")
.format("delta")
.toTable("newStreamTable")
// Show the current snapshot
spark.read
.table("myStreamTable")
.show(false)
Stream-stream Join
Apache Spark 3.1
New in 3.1
New in 3.1
Monitoring Structured Streaming
Structured streaming support in Spark History Server is added in 3.1 release.
Apache Spark 3.1
State Store for Structured Streaming
State Store backed operations. Examples:
• Stateful aggregations
• Drop duplicates
• Stream-stream joins
Maintaining immediate state data across batches.
• High availability
• Fault tolerance
State Store for Structured Streaming
The metrics of the state store are added to Live UI and Spark History Server
Apache Spark 3.1
State Store
HDFS-backed State Store (the default in Spark 3.1):
• Each executor has hashmap(s) containing versioned data
• Updates committed as delta files in HDFS
• Delta files periodically collapsed into snapshots to improve recovery
Drawbacks:
• The quantity of states that can be maintained is limited by the heap size of the executors.
• State expiration by watermark and/or timeouts require full scans over all the data.
Spark cluster
Driver Executor
Executor
hash
map
hash
map
delta files
in HDFS
k1 v1
k2 v2
k1 v1
k2 v2
RocksDB State Store
In Development
RocksDB state store (default in Databricks)
• RocksDB can serve data from the disk with a configurable amount of non-JVM memory.
• Sorting keys using the appropriate column should avoid full scans to find the to-be-dropped keys.
Project Zen: Python Usability
PySpark
2020
Python
47%
SQL
41%
Scala and
others
12%
Python Type
Support
Dependency
Management
Visualization
(3.2)
pandas APIs
(3.2)
Add the type hints [PEP 484] to PySpark!
§ Simple to use
▪ Autocompletion in IDE/Notebook. For example, in Databricks notebook,
▪ Display Python docstring hints by pressing Shift+Tab
▪ Display a list of valid completions by pressing Tab
Apache Spark 3.1
Add the type hints [PEP 484] to PySpark!
§ Simple to use
▪ Autocompletion in IDE/Notebook
§ Richer API documentation
▪ Automatic inclusion of input and output types
§ Better code quality
▪ Static analysis in IDE, error detection, type mismatch, etc.
▪ Running mypy on CI systems.
Apache Spark 3.1
Static Error Detection
Apache Spark 3.1
Python Dependency Management
Supported environments
• Conda
• Venv (Virtualenv)
• PEX
Apache Spark 3.1
worker
worker
worker
driver
tar
tar
tar
Blog post: How to Manage Python Dependencies in
PySpark http://tinyurl.com/pysparkdep
Announced April 24, 2019
Pure Python library
Familiar if coming from pandas
§ Aims at providing the pandas API on top
of Spark
§ Unifies the two ecosystems with a
familiar API
§ Seamless transition between small and
large data
Koalas
Koalas Growth (from 0 to 84 Countries)
> 2 million
imports per month
on Databricks
~ 3 million
PyPI downloads per month
pandas APIs
In Development
pandas APIs
In Development
import pandas as pd
df = pd.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
import databricks.koalas as ks
df = ks.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
pandas APIs
In Development
import pyspark.pandas as ps
df = ps.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
import pandas as pd
df = pd.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
import databricks.koalas as ks
df = ks.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
pandas APIs
In Development
import pyspark.pandas as ps
df = ps.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
import pandas as pd
df = pd.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
import databricks.koalas as ks
df = ks.read_csv(file)
df['x'] = df.y * df.z
df.describe()
df.plot.line(...)
Visualization and Plotting
In Development
import pyspark.pandas as ps
df = ps.DataFrame({
'a': [1, 2, 2.5, 3, 3.5, 4, 5],
'b': [1, 2, 3, 4, 5, 6, 7],
'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
df.plot.hist()
Visualization and Plotting
In Development
Error
Messages
Date-Time
Utility Func.
Ignore Hints
EXPLAIN
Usability Enhancements
New Utility Functions for Unix Time
timestamp_seconds: Creates timestamp from the
number of seconds (can be fractional) since UTC epoch.
> SELECT timestamp_seconds(1230219000);
2008-12-25 07:30:00
unix_seconds: Returns the number of seconds since
1970-01-01 00:00:00 UTC (UTC epoch).
> SELECT unix_seconds(TIMESTAMP('1970-01-01
00:00:01Z'));
1
Apache Spark 3.1
Also, timestamp_millis, timestamp_micros, unix_millis and unix_micros
date_from_unix_date: Create date from the
number of days since 1970-01-01.
> SELECT date_from_unix_date(1);
1970-01-02
unix_date: Returns the number of days since
1970-01-01.
> SELECT unix_date(DATE('1970-01-02'));
1
Name Version Description Example
unix_seconds
unix_micros
unix_millis
3.1 Returns the number of
seconds/microseconds/milliseconds
since 1970-01-01 00:00:00 UTC.
-- input: Timestamp => output: Long
> SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z'));
1
to_unix_timestamp
unix_timestamp
(not recommended, if
the result is not for the
current timestamp)
1.6
1.5
Returns the number of seconds since
1970-01-01 00:00:00 UTC.
-- input: String => output: Long
> SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460098800
> SELECT unix_timestamp(); -- return the current timestamp
1460041200
> SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd');
1460041200
timestamp_seconds 3.1 Creates timestamp from the number of
seconds (can be fractional) since 1970-
01-01 00:00:00 UTC.
-- input: Long => output: Timestamp
> SELECT timestamp_seconds(1230219000);
2008-12-25 07:30:00
from_unixtime 1.5 Returns a timestamp string in the
specified format. The timestamp is
converted from the input number of
seconds elapsed since 1970-01-01
00:00:00 UTC.
-- input: Long => output: Sting
> SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');
1969-12-31 16:00:00
New Utility Functions for Time Zone
current_timezone: Returns the current session local timezone.
> SELECT current_timezone();
Asia/Shanghai
SET TIME ZONE: Sets the local time zone of the current session.
-- Set time zone to the system default.
SET TIME ZONE LOCAL;
-- Set time zone to the region-based zone ID.
SET TIME ZONE 'America/Los_Angeles';
-- Set time zone to the Zone offset.
SET TIME ZONE '+08:00';
Apache Spark 3.1
Upcoming new data type: Timestamp Without Time Zone
EXPLAIN FORMMATTED
§ Spark UI uses EXPLAIN FORMMATTED by default
§ New Format for Adaptive Query Execution
Apache Spark 3.1
The final plan after adaptive optimization
The plan after cost-based optimization
Ignore Hints
Apache Spark 3.1
• Is it FASTER ?
• Caused OOM?
Ignore Hints
Apache Spark 3.1
SET spark.sql.optimizer.disableHints=true
Standardize Error Messages
§ Group exception messages in dedicated files for easy maintenance and
auditing. [SPARK-33539]
§ Improve error messages quality
▪ Establish an error messages guideline for developers
https://spark.apache.org/error-message-guidelines.html
§ Add language-agnostic, locale-agnostic identifiers/SQLSTATE
SPK-10000 MISSING_COLUMN_ERROR: cannot resolve 'bad_column' given input columns:
[a, b, c, d, e]; (SQLSTATE 42704)
In Development
New Doc for
PySpark
Installation
Option for PyPi
Search Function
in Spark Doc
JAVA 17
(3.2)
Documentation and Environments
New Doc for PySpark
Apache Spark 3.1
https://spark.apache.org/docs/latest/api/python/index.html
Search Function in Spark Doc
Apache Spark 3.1
Apache Spark 3.1
Installation Option for PyPI
Apache Spark 3.1
Default Hadoop version is changed to Hadoop 3.2.
Deprecations and Removals
§ Drop Python 2.7, 3.4 and 3.5 (SPARK-32138)
§ Drop R < 3.5 support (SPARK-32073)
§ Remove hive-1.2 distribution (SPARK-32981)
In the upcoming Spark 3.2
§ Deprecate support of Mesos (SPARK-35050)
Apache Spark 3.1
Execution
ANSI Compliance
Python
Performance More
Streaming
Node
Decommissioning
Create Table
Syntax
Sub-expression
Elimination
Spark on K8S
GA
Runtime Error Partition
Pruning
Explicit Cast Char/Varchar
Predicate
Pushdown
Shuffle Hash
Join
Shuffle
Removal
Nested Field
Pruning
Catalog APIs
for JDBC
Ignore
Hints
Stream-
stream Join
New Doc for
PySpark
History Server
Support for SS
Streaming
Table APIs
Python Type
Support
Dependency
Management
Installation
Option for PyPi
Search Function
in Spark Doc
State Schema
Validation
Stage-level
Scheduling
Apache Spark 3.1
ANSI SQL Compliance
Python
Performance More
Streaming
Decorrelation
Framework
Timestamp
w/o Time Zone
Adaptive
Optimization
Scala 2.13
Beta
Error Code Implicit Type
Cast
Interval Type
Complex Type
Support in ORC
Lateral Join
Compile Latency
Reduction
Low-latency
Scheduler
JAVA 17
Push-based
Shuffle
Session
Window
Visualization
and Plotting
RocksDB
State Store
Queryable
State Store
Pythonic Error
Handling
Richer
Input/Output
pandas APIs
Parquet 1.12
(Column Index)
State Store
APIs
DML Metrics
ANSI Mode
GA
In Development
Thank you for your
contributions!
Xiao Li (lixiao @ databricks)
Wenchen Fan (wenchen @ databricks)

Contenu connexe

Tendances

Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...VMware Tanzu
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
 

Tendances (20)

Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 

Similaire à Deep Dive into the New Features of Apache Spark 3.1

What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungSSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungDatabricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleTimothy Spann
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleScyllaDB
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 

Similaire à Deep Dive into the New Features of Apache Spark 3.1 (20)

What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix CheungSSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
 
Meetup talk
Meetup talkMeetup talk
Meetup talk
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any Scale
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any Scale
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 

Dernier (20)

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 

Deep Dive into the New Features of Apache Spark 3.1

  • 1. Deep Dive into the New Features of Apache Spark 3.1 Data + AI Summit 2021 Xiao Li gatorsmile Wenchen Fan cloud-fan
  • 2. 5000+ Across the globe CUSTOMERS Lakehouse One simple platform to unify all of your data, analytics, and AI workloads The Data and AI Company ORIGINAL CREATORS
  • 3. • The Spark team at • Apache Spark Committer and PMC members About Us Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
  • 4. Execution ANSI Compliance Python Performance More Streaming Node Decommissioning Create Table Syntax Sub-expression Elimination Spark on K8S GA Runtime Error Partition Pruning Explicit Cast Char/Varchar Predicate Pushdown Shuffle Hash Join Shuffle Removal Nested Field Pruning Catalog APIs for JDBC Ignore Hints Stream- stream Join New Doc for PySpark History Server Support for SS Streaming Table APIs Python Type Support Dependency Management Installation Option for PyPi Search Function in Spark Doc State Schema Validation Stage-level Scheduling Apache Spark 3.1
  • 6. Fail Earlier for Invalid Data In Spark 3.0 we added the overflow check, in 3.1 more checks are added Apache Spark 3.1
  • 9. ANSI Mode GA in Spark 3.2 • ANSI implicit CAST • More runtime failures for invalid input • No-failure alternatives: TRY_CAST, TRY_DIVIDE, TRY_ADD, … • ... In Development
  • 10. Unified CREATE TABLE SQL Syntax Apache Spark 3.1 CREATE TABLE t (col INT) USING parquet OPTIONS … CREATE TABLE t (col INT) STORED AS parquet SERDEPROPERTIES … // L This creates a Hive text table, super slow! CREATE TABLE t (col INT) // After Spark 3.1 … SET spark.sql.legacy.createHiveTableByDefault=false // J Now this creates native parquet table CREATE TABLE t (col INT)
  • 11. CHAR/VARCHAR Support Apache Spark 3.1 Table insertion fails at runtime if the input string length exceeds the CHAR/VARCHAR length
  • 12. CHAR/VARCHAR Support Apache Spark 3.1 CHAR type values will be padded to the length. Mostly for ANSI compatibility and recommend to use VARCHAR.
  • 13. More ANSI Features Apache Spark 3.1 • Unify SQL temp view and permanent view behaviors (SPARK-33138) • Re-parse and analyze the view SQL string when reading the view • Support column list in INSERT statement (SPARK-32976) • INSERT INTO t(col2, col1) VALUES … • Support ANSI nested bracketed comments (SPARK-28880) • ...
  • 14. More ANSI Features Coming in Spark 3.2! • ANSI year-month and day-time INTERVAL date types • Comparable and persist-able • ANSI TIMESTAMP WITHOUT TIMEZONE type • Simplify the timestamp handling • A new decorrelation framework for correlated subquery • Support outer reference in more places • LATERAL JOIN • FROM t1 JOIN LATERAL (SELECT t1.col + t2.col FROM t2) • SQL error code • More searchable, cross language, JDCB compatible In Development
  • 16. Node Decommissioning Apache Spark 3.1 Gracefully handle scheduled executor shutdown • Auto-scaling: Spark decides to shut down one or more idle executors. • EC2 spot instances: executor gets notified when it’s going to be killed soon. • GCE preemptable instances: same as above • YARN & Kubernetes: kill containers with notification for higher priority tasks.
  • 17. Node Decommissioning Apache Spark 3.1 Migrate RDD cache and shuffle blocks from executors going to be shut down to other live executors. executor 1 executor 2 executor 3 driver Shutdown Trigger 1. send a signal to notify the shutdown
  • 18. Node Decommissioning Apache Spark 3.1 Migrate RDD cache and shuffle blocks from executors going to be shut down to other live executors. executor 1 executor 2 executor 3 driver Shutdown Trigger 1. send a signal to notify the shutdown 2. notify driver about this 2. migrate data 2. migrate data
  • 19. Node Decommissioning Apache Spark 3.1 Migrate RDD cache and shuffle blocks from executors going to be shut down to other live executors. executor 1 executor 2 executor 3 driver Shutdown Trigger 1. send a signal to notify the shutdown 2. notify driver about this 2. migrate data 2. migrate data 3. stop scheduling tasks to executor 1
  • 20. Summary Apache Spark 3.1 • Migrate data blocks to other nodes before shutdown, to avoid recomputing later. • Stop scheduling tasks on the decommissioning node as they likely can’t complete and waste resources. • Launch speculative tasks for tasks running on the decommissioning node that likely can’t complete.
  • 22. Shuffle Hash Join Improvement Spark prefers Sort Merge Join over Shuffle Hash Join to avoid OOM. Apache Spark 3.1 Build Side Probe Side Partition 0 Partition 1 Partition 2 … Partition 0 Partition 1 Partition 2 … partition by join keys partition by join keys Hash Table row1 row2 row3 row4 … look up and join
  • 23. Shuffle Hash Join Improvement Apache Spark 3.1 Makes Shuffle Hash Join on-par with Sort Merge Join and Broadcast Hash Join • Add code-gen for shuffled hash join (SPARK-32421) • Support full outer join in shuffled hash join (SPARK-32399) • Add handling for unique key in non-codegen hash join (SPARK-32420) • Preserve shuffled hash join build side partitioning (SPARK-32330) • ...
  • 24. Partition Pruning Improvement Partition pruning is critical for file scan performance Apache Spark 3.1 Catalog 1. send partition predicates pushed filters files list 2. get pruned file list file scan operator task1 task2 task3 … 3. launch Spark tasks to read files
  • 25. Partition Pruning Improvement Partition pruning is critical for file scan performance Apache Spark 3.1 Catalog 1. send partition predicates pushed filters files list 2. get pruned file list file scan operator task1 task2 task3 … 3. launch Spark tasks to read files
  • 26. Partition Pruning Improvement Apache Spark 3.1 Pushdown more partition predicates • Support Contains, StartsWith and EndsWith in partition pruning (SPARK-33458) • Support date type in partition pruning (SPARK-33477) • Support not-equals in partition pruning (SPARK-33582) • Support NOT IN in partition pruning (SPARK-34538) • ...
  • 27. Predicate Pushdown Improvement Apache Spark 3.1 Join condition mixed with columns from both sides • Can push: FROM t1 JOIN t2 ON t1.key = 1 AND t2.key = 2 • Cannot push: FROM t1 JOIN t2 ON t1.key = 1 OR t2.key = 2 • Cannot push: FROM t1 JOIN t2 ON (t1.key = 1 AND t2.key = 2) OR t1.key = 3 Predicates mixed with data columns and partition columns have similar issues.
  • 28. Predicate Pushdown Improvement Apache Spark 3.1 Conjunctive Normal Form (CNF) (a1 AND a2) OR (b1 AND b2) -> (a1 OR b1) AND (a1 OR b2) AND (a2 OR b1) AND (a2 OR b2) Push down more predicates, less disk IO.
  • 29. Reduce Query Compiling Latency (3.2) Apache Spark 3.1 Optimize for short queries A major improvement of the catalyst framework Stay tuned!
  • 31. Structured Streaming > 120 trillion records/day processed on Databricks with Structured Streaming Stream-stream Join History Server Support for SS Streaming Table APIs RocksDB State Store (3.2) 2.5 x (the past year) 12 x (since 2 years ago) Growth on Databricks with Structured Streaming
  • 32. Streaming Table APIs The APIs to read/write continuous data streams as unbounded tables Apache Spark 3.1 input = spark.readStream .format("rate") .option("rowsPerSecond", 10) .load() input.writeStream .option("checkpointLocation", "path/to/checkpoint/dir1") .format("delta") .toTable("myStreamTable")
  • 33. Streaming Table APIs The APIs to read/write continuous data streams as unbounded tables Apache Spark 3.1 spark.readStream .table("myStreamTable") .select("value") .writeStream .option("checkpointLocation", "path/to/checkpoint/dir2") .format("delta") .toTable("newStreamTable") // Show the current snapshot spark.read .table("myStreamTable") .show(false)
  • 34. Stream-stream Join Apache Spark 3.1 New in 3.1 New in 3.1
  • 35. Monitoring Structured Streaming Structured streaming support in Spark History Server is added in 3.1 release. Apache Spark 3.1
  • 36. State Store for Structured Streaming State Store backed operations. Examples: • Stateful aggregations • Drop duplicates • Stream-stream joins Maintaining immediate state data across batches. • High availability • Fault tolerance
  • 37. State Store for Structured Streaming The metrics of the state store are added to Live UI and Spark History Server Apache Spark 3.1
  • 38. State Store HDFS-backed State Store (the default in Spark 3.1): • Each executor has hashmap(s) containing versioned data • Updates committed as delta files in HDFS • Delta files periodically collapsed into snapshots to improve recovery Drawbacks: • The quantity of states that can be maintained is limited by the heap size of the executors. • State expiration by watermark and/or timeouts require full scans over all the data. Spark cluster Driver Executor Executor hash map hash map delta files in HDFS k1 v1 k2 v2 k1 v1 k2 v2
  • 39. RocksDB State Store In Development RocksDB state store (default in Databricks) • RocksDB can serve data from the disk with a configurable amount of non-JVM memory. • Sorting keys using the appropriate column should avoid full scans to find the to-be-dropped keys.
  • 40. Project Zen: Python Usability
  • 42. Add the type hints [PEP 484] to PySpark! § Simple to use ▪ Autocompletion in IDE/Notebook. For example, in Databricks notebook, ▪ Display Python docstring hints by pressing Shift+Tab ▪ Display a list of valid completions by pressing Tab Apache Spark 3.1
  • 43. Add the type hints [PEP 484] to PySpark! § Simple to use ▪ Autocompletion in IDE/Notebook § Richer API documentation ▪ Automatic inclusion of input and output types § Better code quality ▪ Static analysis in IDE, error detection, type mismatch, etc. ▪ Running mypy on CI systems. Apache Spark 3.1
  • 45. Python Dependency Management Supported environments • Conda • Venv (Virtualenv) • PEX Apache Spark 3.1 worker worker worker driver tar tar tar Blog post: How to Manage Python Dependencies in PySpark http://tinyurl.com/pysparkdep
  • 46. Announced April 24, 2019 Pure Python library Familiar if coming from pandas § Aims at providing the pandas API on top of Spark § Unifies the two ecosystems with a familiar API § Seamless transition between small and large data Koalas
  • 47. Koalas Growth (from 0 to 84 Countries) > 2 million imports per month on Databricks ~ 3 million PyPI downloads per month
  • 49. pandas APIs In Development import pandas as pd df = pd.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import databricks.koalas as ks df = ks.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...)
  • 50. pandas APIs In Development import pyspark.pandas as ps df = ps.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import pandas as pd df = pd.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import databricks.koalas as ks df = ks.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...)
  • 51. pandas APIs In Development import pyspark.pandas as ps df = ps.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import pandas as pd df = pd.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...) import databricks.koalas as ks df = ks.read_csv(file) df['x'] = df.y * df.z df.describe() df.plot.line(...)
  • 52. Visualization and Plotting In Development import pyspark.pandas as ps df = ps.DataFrame({ 'a': [1, 2, 2.5, 3, 3.5, 4, 5], 'b': [1, 2, 3, 4, 5, 6, 7], 'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]}) df.plot.hist()
  • 55. New Utility Functions for Unix Time timestamp_seconds: Creates timestamp from the number of seconds (can be fractional) since UTC epoch. > SELECT timestamp_seconds(1230219000); 2008-12-25 07:30:00 unix_seconds: Returns the number of seconds since 1970-01-01 00:00:00 UTC (UTC epoch). > SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z')); 1 Apache Spark 3.1 Also, timestamp_millis, timestamp_micros, unix_millis and unix_micros date_from_unix_date: Create date from the number of days since 1970-01-01. > SELECT date_from_unix_date(1); 1970-01-02 unix_date: Returns the number of days since 1970-01-01. > SELECT unix_date(DATE('1970-01-02')); 1
  • 56. Name Version Description Example unix_seconds unix_micros unix_millis 3.1 Returns the number of seconds/microseconds/milliseconds since 1970-01-01 00:00:00 UTC. -- input: Timestamp => output: Long > SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z')); 1 to_unix_timestamp unix_timestamp (not recommended, if the result is not for the current timestamp) 1.6 1.5 Returns the number of seconds since 1970-01-01 00:00:00 UTC. -- input: String => output: Long > SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd'); 1460098800 > SELECT unix_timestamp(); -- return the current timestamp 1460041200 > SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd'); 1460041200 timestamp_seconds 3.1 Creates timestamp from the number of seconds (can be fractional) since 1970- 01-01 00:00:00 UTC. -- input: Long => output: Timestamp > SELECT timestamp_seconds(1230219000); 2008-12-25 07:30:00 from_unixtime 1.5 Returns a timestamp string in the specified format. The timestamp is converted from the input number of seconds elapsed since 1970-01-01 00:00:00 UTC. -- input: Long => output: Sting > SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss'); 1969-12-31 16:00:00
  • 57. New Utility Functions for Time Zone current_timezone: Returns the current session local timezone. > SELECT current_timezone(); Asia/Shanghai SET TIME ZONE: Sets the local time zone of the current session. -- Set time zone to the system default. SET TIME ZONE LOCAL; -- Set time zone to the region-based zone ID. SET TIME ZONE 'America/Los_Angeles'; -- Set time zone to the Zone offset. SET TIME ZONE '+08:00'; Apache Spark 3.1 Upcoming new data type: Timestamp Without Time Zone
  • 58. EXPLAIN FORMMATTED § Spark UI uses EXPLAIN FORMMATTED by default § New Format for Adaptive Query Execution Apache Spark 3.1 The final plan after adaptive optimization The plan after cost-based optimization
  • 59. Ignore Hints Apache Spark 3.1 • Is it FASTER ? • Caused OOM?
  • 60. Ignore Hints Apache Spark 3.1 SET spark.sql.optimizer.disableHints=true
  • 61. Standardize Error Messages § Group exception messages in dedicated files for easy maintenance and auditing. [SPARK-33539] § Improve error messages quality ▪ Establish an error messages guideline for developers https://spark.apache.org/error-message-guidelines.html § Add language-agnostic, locale-agnostic identifiers/SQLSTATE SPK-10000 MISSING_COLUMN_ERROR: cannot resolve 'bad_column' given input columns: [a, b, c, d, e]; (SQLSTATE 42704) In Development
  • 62. New Doc for PySpark Installation Option for PyPi Search Function in Spark Doc JAVA 17 (3.2) Documentation and Environments
  • 63. New Doc for PySpark Apache Spark 3.1 https://spark.apache.org/docs/latest/api/python/index.html
  • 64. Search Function in Spark Doc Apache Spark 3.1
  • 66. Installation Option for PyPI Apache Spark 3.1 Default Hadoop version is changed to Hadoop 3.2.
  • 67. Deprecations and Removals § Drop Python 2.7, 3.4 and 3.5 (SPARK-32138) § Drop R < 3.5 support (SPARK-32073) § Remove hive-1.2 distribution (SPARK-32981) In the upcoming Spark 3.2 § Deprecate support of Mesos (SPARK-35050) Apache Spark 3.1
  • 68. Execution ANSI Compliance Python Performance More Streaming Node Decommissioning Create Table Syntax Sub-expression Elimination Spark on K8S GA Runtime Error Partition Pruning Explicit Cast Char/Varchar Predicate Pushdown Shuffle Hash Join Shuffle Removal Nested Field Pruning Catalog APIs for JDBC Ignore Hints Stream- stream Join New Doc for PySpark History Server Support for SS Streaming Table APIs Python Type Support Dependency Management Installation Option for PyPi Search Function in Spark Doc State Schema Validation Stage-level Scheduling Apache Spark 3.1
  • 69. ANSI SQL Compliance Python Performance More Streaming Decorrelation Framework Timestamp w/o Time Zone Adaptive Optimization Scala 2.13 Beta Error Code Implicit Type Cast Interval Type Complex Type Support in ORC Lateral Join Compile Latency Reduction Low-latency Scheduler JAVA 17 Push-based Shuffle Session Window Visualization and Plotting RocksDB State Store Queryable State Store Pythonic Error Handling Richer Input/Output pandas APIs Parquet 1.12 (Column Index) State Store APIs DML Metrics ANSI Mode GA In Development
  • 70. Thank you for your contributions!
  • 71. Xiao Li (lixiao @ databricks) Wenchen Fan (wenchen @ databricks)