Hyperspace: An Indexing Subsystem for Apache Spark

Hyperspace: An Indexing
Subsystem for Apache Spark
Rahul Potharaju & Terry Kim
Microsoft

Rahul Potharaju
Principal Software Engineering Manager @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, .NET for Apache Spark
Publish in academic conferences e.g., VLDB
Terry Kim
Principal Software Engineer @Microsoft
Part of the Spark team at Microsoft
OSS: Hyperspace, Apache Spark,
.NET for Apache Spark

We work on
everything Spark
Offer Spark-as-a-
Service to Microsoft
customers
Contribute back to
Apache Spark
We open source
our work!

Agenda
Rahul Potharaju
Background, Vision, Concepts,
Call-for-Action, Conclusion
Terry Kim
Demo, Performance Deep-dive

In databases, an ‘index’ is a data structure that
improves the speed of data retrieval operations on
a database table at the cost of additional writes and
storage space to maintain the index data structure.
Index from the back of a textbook
N
Namespace 493, 533, 544
Nested-loop join 718-722
Normalization 67, 85-92
Null value 33-35, 168, 252
See also Not-null constraint
O
Optimization =
See Plan selection, Query
optimization
ORDER BY 255-256, 461
Ordering 461-463, 541-543
See also Join ordering, Sorting
R
Random walker 1147, 1154
Range query 639-640, 662-664
Read committed 304-305
READ 849
Read lock
See Shared lock
Read uncommitted 304
Relational calculus 241
Relational database system 3
S
Semijoin 243

Goals of Hyperspace Indexing
Agnostic to Data
Format
Multi-engine
Interoperability
Extensible
Indexing
Infrastructure
Security, Privacy
& Compliance
Should index data in the lake in any format, including text (e.g., CSV, JSON, Parquet, ORC,
Avro, etc.) and binary data (e.g., videos, audios, images, etc.)
Low-cost Index
meta data
management
Should store all meta-data on the data lake and should not assume any other service to
operate correctly
Should make third-party engine integration (e.g., non-Spark systems) feasible, intuitive
and easy – build index through Spark and leverage through Synapse SQL
Should offer mechanisms for easy pluggability of newer auxiliary data structures (related
to indexing)
Should meet the necessary security, privacy, and compliance standards as auxiliary
structures copy the original dataset either partly or in full

Data Lake
Indexing Infrastructure
Query Infrastructure
User-facing Index Management APIs
Allows interaction with the indexing ecosystem
Optimizer Extensions
making optimizer cost and
index-aware, algorithms for
index selection
Index Recommendation
allows index suggestions for query/workload
What-If & Why-not
allows index cost-benefit analysis & explainability
Index Creation & Maintenance API
primitives for index lifecycle management (e.g., creating, refreshing,
deleting), enforcing retention, purge etc.
Log Management API
change log for enabling
engine-interoperability
Index Specifications
layouts for enabling
engine-interoperability
Concurrency Model
primitives for optimistic
concurrency
Datasets
structured e.g., parquet and
unstructured e.g., csv, tsv
Index
non-clustered (columnar covering index,
chunk-elimination, statistics, views)
Vision of
Hyperspace
Indexing

Hyperspace’s
Usage API in
Spark
Usage
Smarts
Customization
// Index Maintenance
createIndex(df: DataFrame,
indexCfg: IndexConfig): Unit
deleteIndex(indexName: String): Unit
restoreIndex(indexName: String): Unit
vacuumIndex(indexName: String): Unit
rebuildIndex(indexName: String): Unit
cancel(indexName: String): Unit
// Debugging and Index Recommendation
explain(df: DataFrame): Unit
whatIf(workload: Array[DataFrame],
indexCfg: IndexConfig): Cost
recommend(workload: Array[DataFrame],
options: RecOptions): Recommendation
// Configuration for Storage and Query Optimizer
hyperspace.system.path
hyperspace.index.creation.[path | namespace]
hyperspace.index.search.[path | namespace]
hyperspace.index.search.disablePublicIndexes
Language Choices
Scala
Python
.NET

… and btw,
the indexes
live on the
data lake!
Filesystem Root
/indexes/<scope = public | user | namespace>
<index name>
_hyperspace_log
create (active)
refresh
active
…
<index-directory-1>
<index-directory-2>
<index-directory-3>
/path/to/data/1
data files

… and index-on-the-lake
provides several benefits!
Index scan
scales
Open format
index
Serverless
access protocol

offers the best offering of Hyperspace’s indexing yet!
• No additional JAR includes
• Fastest access to latest features
• Support for Scala | Python | .NET
• Seamless integration with the UI
• Meta-store integration
• Notebooks for faster iterations

Demo: Hello Hyperspace!
Notebook: https://aka.ms/hellohyperspace

Our first hyperspace: the covering index
Creates a “copy” of the original data in a different sort order. During optimization, reads from
index instead of base table. Useful for eliminating shuffles and filtering predicates.
a b c
SELECT b
WHERE a = ‘Red’
Full-scan
(lineartime)
a b
Covering Index
Index ON a
Include b
SELECT b
WHERE a = ‘Red’
Binary Search
(log time)

a b c
SELECT b, c
FROM Table A, B
JOIN ON A.a = B.a
a p q
Table A
Table B
Without
Indexes
Step 1: Shuffle
(data is not sorted)
a b c a p q
Table A Table B
Step 2: Sort both sides
a b c a p q
Table A Table B
Step 3: Merge
a p q
Result
With Covering
Indexes
Step 1: Optimizer picks index
(pre-shuffled, pre-sorted)
a b c a p q
Idx A Idx B
Step 2: Merge
a p q
Result
Shuffle eliminated
Since shuffle is the most
expensive step, this query
might run faster at scale
Our first hyperspace: the covering index

Demo: Deep-dive into
Hyperspace’s Index-based
Query Optimization

Preliminary
Performance
Evaluation of
Hyperspace
Covering Indexes
Compute Configuration:
• VM Instance = Azure E8 V3
• Workers/Executors = 7
• Cores per executors = 8
• Executor memory = 47 GB
• Autoscale disabled
• ADLS Gen v2
1.2
2.4
1.4
2.3
1.1
3.6
1.3
6.8
5.4
1.8
4.5
1.9
1.8
2.0
3.6
8.9
1.5
1.9 2.1
1.1
3.8
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
0
100
200
300
400
500
600
700
800
900
1000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22
Workload derived from TPC Benchmark™ H (TPC-H)
(Scale Factor = 1000, Apache Spark 2.4, Parquet data)
Baseline Hyperspace Gain
No regressions, up to
9x gains
2.5
3.8 2.5
4.7
6.1
6.7
3.3
4.9
2.9
4.9
2.2
6.9
5.6
10.9
1.8
2.0
2.3
2.2
3.9 1.6
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0
100
200
300
400
500
600
700
800
900
4 6 11 17 25 29 37 50 54 64 78 80 82 93 14a 14b 23a 23b 24a 24b
Workload derived from TPC Benchmark™ DS (TPC-DS) - Top 20
Baseline Hyperspace Gain
Duration(seconds)Duration(seconds)
No regressions, up to
11x gains

2x
1.8x
Hyperspace acceleration
Workloads derived from TPC Benchmark™ H/DS
TPC-H TPC-DS
Up to 11x query
performance
improvement
Preliminary
Performance
Evaluation of
Hyperspace
Covering Indexes
Compute Configuration:
• VM Instance = Azure E8 V3
• Workers/Executors = 7
• Cores per executors = 8
• Executor memory = 47 GB
• Autoscale disabled
• ADLS Gen v2

Open Sourcing Hyperspace v0.1
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open source Apache
Spark
Scala, Python, and .NET support
Accelerated performance on key workloads https://github.com/microsoft/hyperspace
OR
https://aka.ms/hyperspace

Thanks to everyone who is making this possible…

Let us build Hyperspace together!
Meta-data & Lifecycle
Multi-engine interop, concurrency,
support for views & stats
Indexing enhancements
Incremental indexing, index
optimization, support for Delta Lake
Optimizer enhancements
More robust index & view selection,
explainability
Documentation & Tutorials
Best practices, gotchas, more
experiments
More index types
Critique existing design, new
designs… more on this in next slide
Index Recommendation
Single query & multi-query
workload-based recommendation
01
02
03
04
05
06

What type of
hyperspaces
can we build
together?
In Hyperspace, “index” is
used broadly to refer to a
derived dataset i.e., some
auxiliary information about
the underlying data that will
aid in query acceleration
COVERING INDEX
Creates a “copy” of the original data in a different sort order. During
optimization, reads from index instead of base table. Useful for eliminating
shuffles and filtering predicates.
CHUNK-ELIMINATION INDEX
Creates a “pointer” from a search key back to the original data. During
optimization, performs a first lookup to obtain the pointer. Useful for
finding-needle-in-the-haystack queries.
MATERIALIZED VIEWS
Executes a (potentially complex) query, stores the results. During
optimization, entire subtrees can be rewritten. Useful when the same result
is computed several times.
STATISTICS
Collects statistics about the underlying dataset. During optimization, can
power a cost-based optimizer. Useful for join re-ordering, index/view
selection etc.

Open Sourcing Hyperspace v0.1
Conclusion
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open source Apache
Spark
Scala, Python, and .NET support
Accelerated performance on key workloads
2x
1.8x
Hyperspace acceleration
TPC-H TPC-DS
Up to 10x query performance
improvement
https://github.com/microsoft/hyperspace
Open Sourced today
It is not perfect… but that’s where we need your
guidance!

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Hyperspace: An Indexing Subsystem for Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hyperspace: An Indexing Subsystem for Apache Spark

Similar to Hyperspace: An Indexing Subsystem for Apache Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Hyperspace: An Indexing Subsystem for Apache Spark