SlideShare a Scribd company logo
1 of 28
Download to read offline
Hyperspace: An Indexing
Subsystem for Apache Spark
Rahul Potharaju & Terry Kim
Microsoft
Who?
Rahul Potharaju
Principal Software Engineering Manager @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, .NET for Apache Spark
Publish in academic conferences e.g., VLDB
Terry Kim
Principal Software Engineer @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, Apache Spark,
.NET for Apache Spark
We work on
everything Spark
Offer Spark-as-a-
Service to Microsoft
customers
Contribute back to
Apache Spark
We open source
our work!
Agenda
Rahul Potharaju
Background, Vision, Concepts,
Call-for-Action, Conclusion
Terry Kim
Demo, Performance Deep-dive
What is an index!?
In databases, an ‘index’ is a data structure that
improves the speed of data retrieval operations on
a database table at the cost of additional writes and
storage space to maintain the index data structure.
Index from the back of a textbook
N
Namespace 493, 533, 544
Nested-loop join 718-722
Normalization 67, 85-92
Null value 33-35, 168, 252
See also Not-null constraint
O
Optimization =
See Plan selection, Query
optimization
ORDER BY 255-256, 461
Ordering 461-463, 541-543
See also Join ordering, Sorting
R
Random walker 1147, 1154
Range query 639-640, 662-664
Read committed 304-305
READ 849
Read lock
See Shared lock
Read uncommitted 304
Relational calculus 241
Relational database system 3
S
Semijoin 243
Overview of Hyperspace
Goals of Hyperspace Indexing
Agnostic to Data
Format
Multi-engine
Interoperability
Extensible
Indexing
Infrastructure
Security, Privacy
& Compliance
Should index data in the lake in any format, including text (e.g., CSV, JSON, Parquet, ORC,
Avro, etc.) and binary data (e.g., videos, audios, images, etc.)
Low-cost Index
meta data
management
Should store all meta-data on the data lake and should not assume any other service to
operate correctly
Should make third-party engine integration (e.g., non-Spark systems) feasible, intuitive
and easy – build index through Spark and leverage through Synapse SQL
Should offer mechanisms for easy pluggability of newer auxiliary data structures (related
to indexing)
Should meet the necessary security, privacy, and compliance standards as auxiliary
structures copy the original dataset either partly or in full
Data Lake
Indexing Infrastructure
Query Infrastructure
User-facing Index Management APIs
Allows interaction with the indexing ecosystem
Optimizer Extensions
making optimizer cost and
index-aware, algorithms for
index selection
Index Recommendation
allows index suggestions for query/workload
What-If & Why-not
allows index cost-benefit analysis & explainability
Index Creation & Maintenance API
primitives for index lifecycle management (e.g., creating, refreshing,
deleting), enforcing retention, purge etc.
Log Management API
change log for enabling
engine-interoperability
Index Specifications
layouts for enabling
engine-interoperability
Concurrency Model
primitives for optimistic
concurrency
Datasets
structured e.g., parquet and
unstructured e.g., csv, tsv
Index
non-clustered (columnar covering index,
chunk-elimination, statistics, views)
Vision of
Hyperspace
Indexing
Hyperspace’s
Usage API in
Spark
Usage
Smarts
Customization
// Index Maintenance
createIndex(df: DataFrame,
indexCfg: IndexConfig): Unit
deleteIndex(indexName: String): Unit
restoreIndex(indexName: String): Unit
vacuumIndex(indexName: String): Unit
rebuildIndex(indexName: String): Unit
cancel(indexName: String): Unit
// Debugging and Index Recommendation
explain(df: DataFrame): Unit
whatIf(workload: Array[DataFrame],
indexCfg: IndexConfig): Cost
recommend(workload: Array[DataFrame],
options: RecOptions): Recommendation
// Configuration for Storage and Query Optimizer
hyperspace.system.path
hyperspace.index.creation.[path | namespace]
hyperspace.index.search.[path | namespace]
hyperspace.index.search.disablePublicIndexes
Language Choices
Scala
Python
.NET
… and btw,
the indexes
live on the
data lake!
Filesystem Root
/indexes/<scope = public | user | namespace>
<index name>
_hyperspace_log
create (active)
refresh
active
…
<index-directory-1>
<index-directory-2>
<index-directory-3>
/path/to/data/1
data files
… and index-on-the-lake
provides several benefits!
Index scan
scales
Open format
index
Serverless
access protocol
Azure Synapse Analytics
offers the best offering of Hyperspace’s indexing yet!
• No additional JAR includes
• Fastest access to latest features
• Support for Scala | Python | .NET
• Seamless integration with the UI
• Meta-store integration
• Notebooks for faster iterations
Demo: Hello Hyperspace!
Notebook: https://aka.ms/hellohyperspace
Our first hyperspace: the covering index
Creates a “copy” of the original data in a different sort order. During optimization, reads from
index instead of base table. Useful for eliminating shuffles and filtering predicates.
a b c
SELECT b
WHERE a = ‘Red’
Full-scan
(lineartime)
a b
Covering Index
Index ON a
Include b
SELECT b
WHERE a = ‘Red’
Binary Search
(log time)
a b c
SELECT b, c
FROM Table A, B
JOIN ON A.a = B.a
a p q
Table A
Table B
Without
Indexes
Step 1: Shuffle
(data is not sorted)
a b c a p q
Table A Table B
Step 2: Sort both sides
a b c a p q
Table A Table B
Step 3: Merge
a p q
Result
With Covering
Indexes
Step 1: Optimizer picks index
(pre-shuffled, pre-sorted)
a b c a p q
Idx A Idx B
Step 2: Merge
a p q
Result
Shuffle eliminated
Since shuffle is the most
expensive step, this query
might run faster at scale
Our first hyperspace: the covering index
Demo: Deep-dive into
Hyperspace’s Index-based
Query Optimization
Hyperspace Performance
Preliminary
Performance
Evaluation of
Hyperspace
Covering Indexes
Compute Configuration:
• VM Instance = Azure E8 V3
• Workers/Executors = 7
• Cores per executors = 8
• Executor memory = 47 GB
• Autoscale disabled
• ADLS Gen v2
1.2
2.4
1.4
2.3
1.1
3.6
1.3
6.8
5.4
1.8
4.5
1.9
1.8
2.0
3.6
8.9
1.5
1.9 2.1
1.1
3.8
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
0
100
200
300
400
500
600
700
800
900
1000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22
Workload derived from TPC Benchmark™ H (TPC-H)
(Scale Factor = 1000, Apache Spark 2.4, Parquet data)
Baseline Hyperspace Gain
No regressions, up to
9x gains
2.5
3.8 2.5
4.7
6.1
6.7
3.3
4.9
2.9
4.9
2.2
6.9
5.6
10.9
1.8
2.0
2.3
2.2
3.9 1.6
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0
100
200
300
400
500
600
700
800
900
4 6 11 17 25 29 37 50 54 64 78 80 82 93 14a 14b 23a 23b 24a 24b
Workload derived from TPC Benchmark™ DS (TPC-DS) - Top 20
(Scale Factor = 1000, Apache Spark 2.4, Parquet data)
Baseline Hyperspace Gain
Duration(seconds)Duration(seconds)
No regressions, up to
11x gains
2x
1.8x
Hyperspace acceleration
Workloads derived from TPC Benchmark™ H/DS
(Scale Factor = 1000, Apache Spark 2.4, Parquet data)
TPC-H TPC-DS
Up to 11x query
performance
improvement
Preliminary
Performance
Evaluation of
Hyperspace
Covering Indexes
Compute Configuration:
• VM Instance = Azure E8 V3
• Workers/Executors = 7
• Cores per executors = 8
• Executor memory = 47 GB
• Autoscale disabled
• ADLS Gen v2
Open Sourcing Hyperspace v0.1
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open source Apache
Spark
Scala, Python, and .NET support
Accelerated performance on key workloads https://github.com/microsoft/hyperspace
OR
https://aka.ms/hyperspace
Thanks to everyone who is making this possible…
Let us build Hyperspace together!
Meta-data & Lifecycle
Multi-engine interop, concurrency,
support for views & stats
Indexing enhancements
Incremental indexing, index
optimization, support for Delta Lake
Optimizer enhancements
More robust index & view selection,
explainability
Documentation & Tutorials
Best practices, gotchas, more
experiments
More index types
Critique existing design, new
designs… more on this in next slide
Index Recommendation
Single query & multi-query
workload-based recommendation
01
02
03
04
05
06
What type of
hyperspaces
can we build
together?
In Hyperspace, “index” is
used broadly to refer to a
derived dataset i.e., some
auxiliary information about
the underlying data that will
aid in query acceleration
COVERING INDEX
Creates a “copy” of the original data in a different sort order. During
optimization, reads from index instead of base table. Useful for eliminating
shuffles and filtering predicates.
CHUNK-ELIMINATION INDEX
Creates a “pointer” from a search key back to the original data. During
optimization, performs a first lookup to obtain the pointer. Useful for
finding-needle-in-the-haystack queries.
MATERIALIZED VIEWS
Executes a (potentially complex) query, stores the results. During
optimization, entire subtrees can be rewritten. Useful when the same result
is computed several times.
STATISTICS
Collects statistics about the underlying dataset. During optimization, can
power a cost-based optimizer. Useful for join re-ordering, index/view
selection etc.
Open Sourcing Hyperspace v0.1
Conclusion
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open source Apache
Spark
Scala, Python, and .NET support
Accelerated performance on key workloads
2x
1.8x
Hyperspace acceleration
(Scale Factor = 1000, Apache Spark 2.4, Parquet data)
TPC-H TPC-DS
Up to 10x query performance
improvement
https://github.com/microsoft/hyperspace
Open Sourced today
It is not perfect… but that’s where we need your
guidance!
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

What's hot

What's hot (20)

Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 

Similar to Hyperspace: An Indexing Subsystem for Apache Spark

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
sqlserver.co.il
 

Similar to Hyperspace: An Indexing Subsystem for Apache Spark (20)

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Spark1
Spark1Spark1
Spark1
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 

Recently uploaded (20)

Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 

Hyperspace: An Indexing Subsystem for Apache Spark

  • 1.
  • 2. Hyperspace: An Indexing Subsystem for Apache Spark Rahul Potharaju & Terry Kim Microsoft
  • 4. Rahul Potharaju Principal Software Engineering Manager @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace, .NET for Apache Spark Publish in academic conferences e.g., VLDB Terry Kim Principal Software Engineer @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace, Apache Spark, .NET for Apache Spark
  • 5. We work on everything Spark Offer Spark-as-a- Service to Microsoft customers Contribute back to Apache Spark We open source our work!
  • 6. Agenda Rahul Potharaju Background, Vision, Concepts, Call-for-Action, Conclusion Terry Kim Demo, Performance Deep-dive
  • 7. What is an index!?
  • 8. In databases, an ‘index’ is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Index from the back of a textbook N Namespace 493, 533, 544 Nested-loop join 718-722 Normalization 67, 85-92 Null value 33-35, 168, 252 See also Not-null constraint O Optimization = See Plan selection, Query optimization ORDER BY 255-256, 461 Ordering 461-463, 541-543 See also Join ordering, Sorting R Random walker 1147, 1154 Range query 639-640, 662-664 Read committed 304-305 READ 849 Read lock See Shared lock Read uncommitted 304 Relational calculus 241 Relational database system 3 S Semijoin 243
  • 10. Goals of Hyperspace Indexing Agnostic to Data Format Multi-engine Interoperability Extensible Indexing Infrastructure Security, Privacy & Compliance Should index data in the lake in any format, including text (e.g., CSV, JSON, Parquet, ORC, Avro, etc.) and binary data (e.g., videos, audios, images, etc.) Low-cost Index meta data management Should store all meta-data on the data lake and should not assume any other service to operate correctly Should make third-party engine integration (e.g., non-Spark systems) feasible, intuitive and easy – build index through Spark and leverage through Synapse SQL Should offer mechanisms for easy pluggability of newer auxiliary data structures (related to indexing) Should meet the necessary security, privacy, and compliance standards as auxiliary structures copy the original dataset either partly or in full
  • 11. Data Lake Indexing Infrastructure Query Infrastructure User-facing Index Management APIs Allows interaction with the indexing ecosystem Optimizer Extensions making optimizer cost and index-aware, algorithms for index selection Index Recommendation allows index suggestions for query/workload What-If & Why-not allows index cost-benefit analysis & explainability Index Creation & Maintenance API primitives for index lifecycle management (e.g., creating, refreshing, deleting), enforcing retention, purge etc. Log Management API change log for enabling engine-interoperability Index Specifications layouts for enabling engine-interoperability Concurrency Model primitives for optimistic concurrency Datasets structured e.g., parquet and unstructured e.g., csv, tsv Index non-clustered (columnar covering index, chunk-elimination, statistics, views) Vision of Hyperspace Indexing
  • 12. Hyperspace’s Usage API in Spark Usage Smarts Customization // Index Maintenance createIndex(df: DataFrame, indexCfg: IndexConfig): Unit deleteIndex(indexName: String): Unit restoreIndex(indexName: String): Unit vacuumIndex(indexName: String): Unit rebuildIndex(indexName: String): Unit cancel(indexName: String): Unit // Debugging and Index Recommendation explain(df: DataFrame): Unit whatIf(workload: Array[DataFrame], indexCfg: IndexConfig): Cost recommend(workload: Array[DataFrame], options: RecOptions): Recommendation // Configuration for Storage and Query Optimizer hyperspace.system.path hyperspace.index.creation.[path | namespace] hyperspace.index.search.[path | namespace] hyperspace.index.search.disablePublicIndexes Language Choices Scala Python .NET
  • 13. … and btw, the indexes live on the data lake! Filesystem Root /indexes/<scope = public | user | namespace> <index name> _hyperspace_log create (active) refresh active … <index-directory-1> <index-directory-2> <index-directory-3> /path/to/data/1 data files
  • 14. … and index-on-the-lake provides several benefits! Index scan scales Open format index Serverless access protocol
  • 15. Azure Synapse Analytics offers the best offering of Hyperspace’s indexing yet! • No additional JAR includes • Fastest access to latest features • Support for Scala | Python | .NET • Seamless integration with the UI • Meta-store integration • Notebooks for faster iterations
  • 16. Demo: Hello Hyperspace! Notebook: https://aka.ms/hellohyperspace
  • 17. Our first hyperspace: the covering index Creates a “copy” of the original data in a different sort order. During optimization, reads from index instead of base table. Useful for eliminating shuffles and filtering predicates. a b c SELECT b WHERE a = ‘Red’ Full-scan (lineartime) a b Covering Index Index ON a Include b SELECT b WHERE a = ‘Red’ Binary Search (log time)
  • 18. a b c SELECT b, c FROM Table A, B JOIN ON A.a = B.a a p q Table A Table B Without Indexes Step 1: Shuffle (data is not sorted) a b c a p q Table A Table B Step 2: Sort both sides a b c a p q Table A Table B Step 3: Merge a p q Result With Covering Indexes Step 1: Optimizer picks index (pre-shuffled, pre-sorted) a b c a p q Idx A Idx B Step 2: Merge a p q Result Shuffle eliminated Since shuffle is the most expensive step, this query might run faster at scale Our first hyperspace: the covering index
  • 19. Demo: Deep-dive into Hyperspace’s Index-based Query Optimization
  • 21. Preliminary Performance Evaluation of Hyperspace Covering Indexes Compute Configuration: • VM Instance = Azure E8 V3 • Workers/Executors = 7 • Cores per executors = 8 • Executor memory = 47 GB • Autoscale disabled • ADLS Gen v2 1.2 2.4 1.4 2.3 1.1 3.6 1.3 6.8 5.4 1.8 4.5 1.9 1.8 2.0 3.6 8.9 1.5 1.9 2.1 1.1 3.8 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 0 100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 Workload derived from TPC Benchmark™ H (TPC-H) (Scale Factor = 1000, Apache Spark 2.4, Parquet data) Baseline Hyperspace Gain No regressions, up to 9x gains 2.5 3.8 2.5 4.7 6.1 6.7 3.3 4.9 2.9 4.9 2.2 6.9 5.6 10.9 1.8 2.0 2.3 2.2 3.9 1.6 0.0 2.0 4.0 6.0 8.0 10.0 12.0 0 100 200 300 400 500 600 700 800 900 4 6 11 17 25 29 37 50 54 64 78 80 82 93 14a 14b 23a 23b 24a 24b Workload derived from TPC Benchmark™ DS (TPC-DS) - Top 20 (Scale Factor = 1000, Apache Spark 2.4, Parquet data) Baseline Hyperspace Gain Duration(seconds)Duration(seconds) No regressions, up to 11x gains
  • 22. 2x 1.8x Hyperspace acceleration Workloads derived from TPC Benchmark™ H/DS (Scale Factor = 1000, Apache Spark 2.4, Parquet data) TPC-H TPC-DS Up to 11x query performance improvement Preliminary Performance Evaluation of Hyperspace Covering Indexes Compute Configuration: • VM Instance = Azure E8 V3 • Workers/Executors = 7 • Cores per executors = 8 • Executor memory = 47 GB • Autoscale disabled • ADLS Gen v2
  • 23. Open Sourcing Hyperspace v0.1 New extensible indexing subsystem for Apache Spark Simply add on—no core changes needed Same technology that powers the indexing engine inside Azure Synapse Analytics Works out-of-box with open source Apache Spark Scala, Python, and .NET support Accelerated performance on key workloads https://github.com/microsoft/hyperspace OR https://aka.ms/hyperspace
  • 24. Thanks to everyone who is making this possible…
  • 25. Let us build Hyperspace together! Meta-data & Lifecycle Multi-engine interop, concurrency, support for views & stats Indexing enhancements Incremental indexing, index optimization, support for Delta Lake Optimizer enhancements More robust index & view selection, explainability Documentation & Tutorials Best practices, gotchas, more experiments More index types Critique existing design, new designs… more on this in next slide Index Recommendation Single query & multi-query workload-based recommendation 01 02 03 04 05 06
  • 26. What type of hyperspaces can we build together? In Hyperspace, “index” is used broadly to refer to a derived dataset i.e., some auxiliary information about the underlying data that will aid in query acceleration COVERING INDEX Creates a “copy” of the original data in a different sort order. During optimization, reads from index instead of base table. Useful for eliminating shuffles and filtering predicates. CHUNK-ELIMINATION INDEX Creates a “pointer” from a search key back to the original data. During optimization, performs a first lookup to obtain the pointer. Useful for finding-needle-in-the-haystack queries. MATERIALIZED VIEWS Executes a (potentially complex) query, stores the results. During optimization, entire subtrees can be rewritten. Useful when the same result is computed several times. STATISTICS Collects statistics about the underlying dataset. During optimization, can power a cost-based optimizer. Useful for join re-ordering, index/view selection etc.
  • 27. Open Sourcing Hyperspace v0.1 Conclusion New extensible indexing subsystem for Apache Spark Simply add on—no core changes needed Same technology that powers the indexing engine inside Azure Synapse Analytics Works out-of-box with open source Apache Spark Scala, Python, and .NET support Accelerated performance on key workloads 2x 1.8x Hyperspace acceleration (Scale Factor = 1000, Apache Spark 2.4, Parquet data) TPC-H TPC-DS Up to 10x query performance improvement https://github.com/microsoft/hyperspace Open Sourced today It is not perfect… but that’s where we need your guidance!
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.