At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
4. Rahul Potharaju
Principal Software Engineering Manager @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, .NET for Apache Spark
Publish in academic conferences e.g., VLDB
Terry Kim
Principal Software Engineer @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, Apache Spark,
.NET for Apache Spark
5. We work on
everything Spark
Offer Spark-as-a-
Service to Microsoft
customers
Contribute back to
Apache Spark
We open source
our work!
8. In databases, an ‘index’ is a data structure that
improves the speed of data retrieval operations on
a database table at the cost of additional writes and
storage space to maintain the index data structure.
Index from the back of a textbook
N
Namespace 493, 533, 544
Nested-loop join 718-722
Normalization 67, 85-92
Null value 33-35, 168, 252
See also Not-null constraint
O
Optimization =
See Plan selection, Query
optimization
ORDER BY 255-256, 461
Ordering 461-463, 541-543
See also Join ordering, Sorting
R
Random walker 1147, 1154
Range query 639-640, 662-664
Read committed 304-305
READ 849
Read lock
See Shared lock
Read uncommitted 304
Relational calculus 241
Relational database system 3
S
Semijoin 243
10. Goals of Hyperspace Indexing
Agnostic to Data
Format
Multi-engine
Interoperability
Extensible
Indexing
Infrastructure
Security, Privacy
& Compliance
Should index data in the lake in any format, including text (e.g., CSV, JSON, Parquet, ORC,
Avro, etc.) and binary data (e.g., videos, audios, images, etc.)
Low-cost Index
meta data
management
Should store all meta-data on the data lake and should not assume any other service to
operate correctly
Should make third-party engine integration (e.g., non-Spark systems) feasible, intuitive
and easy – build index through Spark and leverage through Synapse SQL
Should offer mechanisms for easy pluggability of newer auxiliary data structures (related
to indexing)
Should meet the necessary security, privacy, and compliance standards as auxiliary
structures copy the original dataset either partly or in full
11. Data Lake
Indexing Infrastructure
Query Infrastructure
User-facing Index Management APIs
Allows interaction with the indexing ecosystem
Optimizer Extensions
making optimizer cost and
index-aware, algorithms for
index selection
Index Recommendation
allows index suggestions for query/workload
What-If & Why-not
allows index cost-benefit analysis & explainability
Index Creation & Maintenance API
primitives for index lifecycle management (e.g., creating, refreshing,
deleting), enforcing retention, purge etc.
Log Management API
change log for enabling
engine-interoperability
Index Specifications
layouts for enabling
engine-interoperability
Concurrency Model
primitives for optimistic
concurrency
Datasets
structured e.g., parquet and
unstructured e.g., csv, tsv
Index
non-clustered (columnar covering index,
chunk-elimination, statistics, views)
Vision of
Hyperspace
Indexing
12. Hyperspace’s
Usage API in
Spark
Usage
Smarts
Customization
// Index Maintenance
createIndex(df: DataFrame,
indexCfg: IndexConfig): Unit
deleteIndex(indexName: String): Unit
restoreIndex(indexName: String): Unit
vacuumIndex(indexName: String): Unit
rebuildIndex(indexName: String): Unit
cancel(indexName: String): Unit
// Debugging and Index Recommendation
explain(df: DataFrame): Unit
whatIf(workload: Array[DataFrame],
indexCfg: IndexConfig): Cost
recommend(workload: Array[DataFrame],
options: RecOptions): Recommendation
// Configuration for Storage and Query Optimizer
hyperspace.system.path
hyperspace.index.creation.[path | namespace]
hyperspace.index.search.[path | namespace]
hyperspace.index.search.disablePublicIndexes
Language Choices
Scala
Python
.NET
13. … and btw,
the indexes
live on the
data lake!
Filesystem Root
/indexes/<scope = public | user | namespace>
<index name>
_hyperspace_log
create (active)
refresh
active
…
<index-directory-1>
<index-directory-2>
<index-directory-3>
/path/to/data/1
data files
15. Azure Synapse Analytics
offers the best offering of Hyperspace’s indexing yet!
• No additional JAR includes
• Fastest access to latest features
• Support for Scala | Python | .NET
• Seamless integration with the UI
• Meta-store integration
• Notebooks for faster iterations
17. Our first hyperspace: the covering index
Creates a “copy” of the original data in a different sort order. During optimization, reads from
index instead of base table. Useful for eliminating shuffles and filtering predicates.
a b c
SELECT b
WHERE a = ‘Red’
Full-scan
(lineartime)
a b
Covering Index
Index ON a
Include b
SELECT b
WHERE a = ‘Red’
Binary Search
(log time)
18. a b c
SELECT b, c
FROM Table A, B
JOIN ON A.a = B.a
a p q
Table A
Table B
Without
Indexes
Step 1: Shuffle
(data is not sorted)
a b c a p q
Table A Table B
Step 2: Sort both sides
a b c a p q
Table A Table B
Step 3: Merge
a p q
Result
With Covering
Indexes
Step 1: Optimizer picks index
(pre-shuffled, pre-sorted)
a b c a p q
Idx A Idx B
Step 2: Merge
a p q
Result
Shuffle eliminated
Since shuffle is the most
expensive step, this query
might run faster at scale
Our first hyperspace: the covering index
22. 2x
1.8x
Hyperspace acceleration
Workloads derived from TPC Benchmark™ H/DS
(Scale Factor = 1000, Apache Spark 2.4, Parquet data)
TPC-H TPC-DS
Up to 11x query
performance
improvement
Preliminary
Performance
Evaluation of
Hyperspace
Covering Indexes
Compute Configuration:
• VM Instance = Azure E8 V3
• Workers/Executors = 7
• Cores per executors = 8
• Executor memory = 47 GB
• Autoscale disabled
• ADLS Gen v2
23. Open Sourcing Hyperspace v0.1
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open source Apache
Spark
Scala, Python, and .NET support
Accelerated performance on key workloads https://github.com/microsoft/hyperspace
OR
https://aka.ms/hyperspace
25. Let us build Hyperspace together!
Meta-data & Lifecycle
Multi-engine interop, concurrency,
support for views & stats
Indexing enhancements
Incremental indexing, index
optimization, support for Delta Lake
Optimizer enhancements
More robust index & view selection,
explainability
Documentation & Tutorials
Best practices, gotchas, more
experiments
More index types
Critique existing design, new
designs… more on this in next slide
Index Recommendation
Single query & multi-query
workload-based recommendation
01
02
03
04
05
06
26. What type of
hyperspaces
can we build
together?
In Hyperspace, “index” is
used broadly to refer to a
derived dataset i.e., some
auxiliary information about
the underlying data that will
aid in query acceleration
COVERING INDEX
Creates a “copy” of the original data in a different sort order. During
optimization, reads from index instead of base table. Useful for eliminating
shuffles and filtering predicates.
CHUNK-ELIMINATION INDEX
Creates a “pointer” from a search key back to the original data. During
optimization, performs a first lookup to obtain the pointer. Useful for
finding-needle-in-the-haystack queries.
MATERIALIZED VIEWS
Executes a (potentially complex) query, stores the results. During
optimization, entire subtrees can be rewritten. Useful when the same result
is computed several times.
STATISTICS
Collects statistics about the underlying dataset. During optimization, can
power a cost-based optimizer. Useful for join re-ordering, index/view
selection etc.
27. Open Sourcing Hyperspace v0.1
Conclusion
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open source Apache
Spark
Scala, Python, and .NET support
Accelerated performance on key workloads
2x
1.8x
Hyperspace acceleration
(Scale Factor = 1000, Apache Spark 2.4, Parquet data)
TPC-H TPC-DS
Up to 10x query performance
improvement
https://github.com/microsoft/hyperspace
Open Sourced today
It is not perfect… but that’s where we need your
guidance!