This document summarizes a presentation about accelerating Apache Spark workloads using NVIDIA's RAPIDS accelerator. It notes that global data generation is expected to grow exponentially to 221 zettabytes by 2026. RAPIDS can provide significant speedups and cost savings for Spark workloads by leveraging GPUs. Benchmark results show a NVIDIA decision support benchmark running 5.7x faster and with 4.5x lower costs on a GPU cluster compared to CPU. The document outlines RAPIDS integration with Spark and provides information on qualification, configuration, and future developments.
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
SFBigAnalytics_SparkRapid_20220622.pdf
1. SF Big Analytics Meetup
Accelerate Spark With RAPIDS For Cost Savings
Sameer Raheja, Sr. Dir, Engineering
June 22, 2023
2. 221 Zettabytes of Data Generated by 2026
221.0
0.0
50.0
100.0
150.0
200.0
250.0
2021 2022 2023 2024 2025 2026
Zettabytes
Zettabytes
Source: IDC, Global DataSphere Forecast, 2022-2026: Enterprise Organizations Driving Most of the Growth, May 2022
ZB
of
Data
Generation
2021-2026 CAGR of 21.2%
3. How to Deal With Data Growth?
Increase compute capacity to
meet growing data demands
Accept higher latency as data
volumes grow
Increase Cost
Increase Time
Down sample data to meet
cost & SLA targets
Reduce Fidelity
4. Scaling ETL Processing With Apache Spark with GPUs
RAPIDS Accelerator for Apache Spark
2030
2020
2010
2000
Hadoop Spark Spark GPU
Spark 2.0 on
CPUs
GPU
Accelerated
Spark 3.0
Growth in
Requirement for
Data Processing
6. NVIDIA Decision Support Benchmark
NVIDIA Decision Support
(NDS) is our adaptation of
the TPC-DS benchmark
often used by Spark
customers and providers.
NDS consists of the same
100+ SQL queries as the
industry standard
benchmark but has
modified parts for execution
scripts.
The NDS benchmark is
derived from the TPC-DS
benchmark and as such is
not comparable to published
TPC-DS results, as the NDS
results do not comply with
the TPC-DS Specification.
https://github.com/nvidia/spark-rapids-benchmarks
9. NVIDIA Decision Support Benchmark 3TB
RAPIDS Spark release 23.02
The NDS benchmark is derived from the TPCDS benchmark and as such is not comparable to published TPCDS results, as the NDS results do not comply with the TPCDS Specification.
5.7x speedup 4.5x cost savings 5x more efficient
176
31
0
50
100
150
200
CPU GPU
Time (min)
12425
2479
0
5000
10000
15000
CPU GPU
Power (watts)
$32.62
$7.20
$-
$5.00
$10.00
$15.00
$20.00
$25.00
$30.00
$35.00
CPU GPU
Cost ($)
10. Retail use case on Google Cloud Dataproc
Automate the ranking of thousands of online shelves
2x speedup
40 nodes
40% cost benefit
40 nodes
60% cost benefit
20 nodes
1.6x speedup
20 nodes
1.2x speedup
10 nodes
70% cost benefit
10 nodes
11. Cost Savings Across Clouds
NDS benchmark scale factor 3000
GCP
Dataproc 2.1
AWS
EMR 6.9
AWS
Databricks
Photon 10.4
Azure
Databricks
Photon 10.4
Cost Savings 78% 42% 39% 34%
CPU n1-highmem-32 (4) m6gd.8xlarge (8) m6gd.8xlarge (8) Standard_E16ds_v4 (8)
GPU
n1-highmem-32 (4) +
2 x T4 / node
g4dn.12xlarge (2)
4 x T4 / node
g5.8xlarge (4)
1 x A10 / node
Standard_NC8as_T4_v3
(8)
1 x T4 / node
GCP pricing: https://cloud.google.com/compute/vm-instance-pricing, us-central1 , Feb 2023
AWS pricing: https://aws.amazon.com/emr/pricing/ , us-west-2, Feb 2023
Databricks AWS pricing: https://www.databricks.com/product/aws-pricing , Feb 2023
Databricks Azure pricing: https://azure.microsoft.com/en-us/pricing/details/databricks/#instance-type-support , Feb 2023
12. RAPIDS Accelerator Distribution Availability
Apache Spark 3.x
Community
Release
Available
Now
Databricks
Machine
Learning
Runtime
Available
Now
Amazon EMR
Available
Now
Available
Now
Cloudera CDP Google Cloud
Dataproc
Available
Now
Available
Now
Microsoft Azure
Synapse Analytics
17. Apache Spark 3.x
RAPIDS Accelerator for Apache Spark
Spark Core
Python SQL R Java Scala
Spark SQL Spark ML
Spark
Streaming
GraphX
Driver
Cluster
manager
Worker Nodes
Spark built in processing RAPIDS Spark plugin
18. RAPIDS Accelerator for Apache Spark
Spark Plugin for GPU Acceleration
Spark Core
Spark DataFrame / SQL API Application
RAPIDS C++ Libraries
CUDA
Java Bindings
if gpu_enabled(operation && datatype):
RAPIDS Spark
else:
Spark CPU
RAPIDS Spark plugin
19. RAPIDS Accelerator for Apache Spark
Spark Plugin for GPU Acceleration
RAPIDS Accelerator
for Apache Spark
UCX Libraries
RAPIDS C++ Libraries
CUDA
APACHE SPARK CORE
Spark SQL API Spark Shuffle
DataFrame API
if gpu_enabled(operation
&& datatype):
RAPIDS Spark
else:
Spark CPU
Custom Implementation of shuffle
optimized for RDMA and GPU-to-GPU
direct communication
Java Bindings
DISTRIBUTED SCALE-OUT SPARK APPLICATIONS
Java Bindings
20. No Query Changes
• Add jars to classpath and set
spark.plugins config
• Same SQL and DataFrame code
• Compatible with PySpark, SparkR,
Koalas, and other DataFrame-based APIs
• Seamless fallback to CPU for
unsupported operations
spark.sql( """
SELECT
o_order_priority
count(*) as order_count
FROM
orders
WHERE
o_orderdate >= DATE '1993-07-01'
AND o_orderdate < DATE '1993-07-01' + interval '3' month
AND EXISTS (
SELECT
*
FROM lineitem
WHERE
l_orderkey = o_orderkey
AND l_commitdate < l_receiptdate
)
GROUP BY
o_orderpriority ORDER BY o_orderpriority
""" ).show()
Example: SQL query text processed directly by Spark using spark.sql() API.
No changes are required to the query or any of the Spark code.
21. Spark SQL & DataFrame Query Execution
Dataframe
Logical
Plan
Physical
Plan
RDD
[InternalRow]
GPU Physical
Plan
RDD
[ColumnarBatch]
bar.groupBy(
col(”product_id”),
col(“ds”))
.agg(
max(col(“price”)) -
min(col(“price”)).alias(“range”))
SELECT product_id, ds,
max(price) – min(price)
AS range
FROM bar
GROUP BY product_id, ds
22. Spark SQL & DataFrame Query Execution
Read Parquet File
First Stage
Aggregate
Shuffle Exchange
Second Stage
Aggregate
Write Parquet File
Combine Shuffle
Data
Read Parquet File
First Stage
Aggregate
Shuffle Exchange
Second Stage
Aggregate
Write Parquet File
Convert to Row
Format
Convert to Row
Format
GPU
PHYISCAL
PLAN
CPU
PHYISCAL
PLAN
23. Task Level Parallelism vs. Data Level Parallelism
Task Level Parallelism vs Data Level Parallelism
Data Parallelism is cooperative. It uses a more consistent amount of memory and reduces total data in memory for
operations like sort which results in less spilling.
Task 1
Task 2
Task 3
Task 4
CPU
Task 1
Task 2
GPU
Data Data
24. Is this a silver bullet?
v Small amounts of data
v Highly cache coherent processing
v Slow distributed filesystem
v Slow local disks or network for shuffle
v CPU/GPU transitions on many rows
v User Defined Functions (UDFs) not on GPU
No
160
550
1250
3500
12288
24576 25600
46080
307200
1048576
SPINNING
DISK
SSD 10GIGE NVME PCIE GEN3 PCIE GEN4 DDR4-3200
DIMM
TYPICAL PC
RAM
NVLINK CPU CACHE
MB/s
(Log
Scale)
25. But it can be amazing
v High cardinality processing
v Joins
v Aggregations
v Sort
v Window operations, especially large windows
v Complicated processing
v Transcoding data formats
v Encoding Parquet and ORC
v Reading CSV
Where the RAPIDS Accelerator excels
208
11
0
50
100
150
200
Seconds
18X Speedup on Window
CPU GPU
117
12
0
30
60
90
120
Seconds
9X Speedup on Large Join
CPU GPU
https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31436/