This document discusses accelerating Python user-defined functions (UDFs) in PySpark using Numba and PyGDF. It describes how data movement between the JVM and Python workers is currently a bottleneck for PySpark Python UDFs. With Apache Arrow, data can be transferred in a columnar format without serialization, improving performance. PyGDF enables defining UDFs that operate directly on GPU data frames using Numba for further acceleration. This allows leveraging GPUs to optimize complex UDFs in PySpark. Future work includes optimizing joins in PyGDF and supporting distributed GPU processing.
3. 3
DATA DELUGE TO INSIGHT HUNGRY
INCREASING DATA VARIETY
Search
Marketing
Behavioral
Targeting
Dynamic
Funnels
User
Generated
Content
Mobile Web
SMS/MMS
Sentiment
HD Video
Speech To
Text
Product/
Service Logs
Social
Network
Business
Data Feeds
User Click
Stream
Sensors Infotainment
Systems
Wearable
Devices
Cyber
Security Logs
Connected
Vehicles
Machine
Data
IoT Data
Dynamic
Pricing
Payment
Record
Purchase
Detail
Purchase
Record
Support
Contacts
Segmentation
Offer
Details
Web
Logs
Offer
History
A/B
Testing
BUSINESS
PROCESS
PETABYTESTERABYTESGIGABYTESEXABYTESZETTABYTES
Streaming
Video
Natural
Language
Processing
WEB
DIGITAL
AI
5. 5
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
6. 6
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
25-100x
Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing
7. 7
Cluster computing framework
Spark has almost become synonymous with Hadoop and Big Data•
Integrates with nearly the entire Big Data ecosystem•
The processing layer for big data and leading ML framework•
Five main components RDD API, SQL, Streaming,• MLlib, and GraphX
APACHE SPARK
8. 8
SPARK IS NOT ENOUGH
Basic workloads are bottlenecked by the CPU
Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR
In a simple benchmark consisting•
of aggregating data, the CPU is
the bottleneck
This is after the data is parsed and•
cached into memory which is
another common bottleneck
The CPU bottleneck is even worse•
in more complex workloads!
SELECT cab_type, count(*) FROM
trips_orc GROUP BY cab_type;
9. 9
SPARK ECOSYSTEM
Lacks Full GPU Integration
4 Core Parts• : SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph
Spark is currently optimizing its existing code base, adding more usability, not GPU support yet•
10. 10
SPARK ECOSYSTEM
Using• Numba, Microsoft Azure team released a
basic example showing a ~5x speedup using
GPUs with Spark
This example is extremely limited in that•
they’re not passing any real data to the Python
process or the GPU
When wanting to pass data from Spark to the•
GPU there are new issues and performance
considerations
GPU-Acceleration Possible But Not Ideal
Source: https://github.com/Azure/aztk/blob/master/node_scripts/jupyter-
samples/GPU%2Bvs%2BCPU%2Busing%2BNumba.ipynb
14. 14
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
16. 16
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai Graphistry
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
BlazingDB MapDSimantex
Anaconda GunrocknvGRAPH
17. 17
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
Simantex
Read DataH2O.ai Graphistry
Anaconda GunrocknvGRAPH
18. 18
APACHE ARROW COMMON DATA LAYER
From Apache Arrow Home Page - https://arrow.apache.org/
19. 19
GPU-ACCELERATED ARCHITECTURE NOW
Single data format and shared access to data on GPU
CPU GPU
GPU
MEM
Read Data
BlazingDB MapD Load Data
Apache Arrow
Powered by:
GPU Data Frame
Simantex
H2O.ai Graphistry
Anaconda GunrocknvGRAPH
20. 20
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
GPU DATA FRAME
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GOAI)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
21. 21
GPU OPEN ANALYTICS INITIATIVE
First Project, the GPU Data Frame
No Copy & Converts - Full
Interoperability
H2O.ai
Numba Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
GPU Data Frame is the first project of GOAI•
Apache Arrow for GPU•
libgdf• : A C library of helper functions, including:
Copying the GDF metadata block to the host and parsing it•
to a host-side struct.
Importing/exporting a GDF using the CUDA IPC mechanism.•
CUDA kernels to perform element• -wise math operations on
GDF columns.
CUDA sort, join, and reduction operations on GDFs.•
pygdf• : A Python library for manipulating GDFs
Python interface to• libgdf library with additional
functionality
Creating GDFs from• Numpy arrays and Pandas DataFrames
JIT compilation of group by and filter kernels using• Numba
dask_gdf• : Extension for Dask to work with distributed GDFs.
Same operations as• pygdf, but working on GDFs chunked
onto different GPUs and different servers.
Will bring the same Kubernetes support that• Dask already
has.
github.com/gpuopenanalytics
nvGRAPH
Apache Arrow
Powered by:
Simantex
23. 23
GPU ACCELERATION ACROSS THE ECOSYSTEM
Apache Arrow
H2O.ai
Numba Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
nvGRAPH
Apache Arrow
Powered by:
Simantex
24. 24
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GoAi)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
33. 33
PYSPARK UDFS
When Spark built-ins can’t get the job done alone
User defined functions (UDFs)•
allow for creating column-based
functions outside of the scope of
Spark built-in functions
UDFs can be defined in Scala/Java•
or Python and be called from
PySpark
Using Python lambdas in map•
functions is essentially the same
as using a Python UDF
34. 34
PYSPARK PYTHON UDFS
Python UDFs in PySpark need Python workers and data movement
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
35. 35
PYSPARK PYTHON UDFS
Moving data from the JVM to Python efficiently is hard
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
36. 36
PYSPARK PYTHON UDFS
How is the data movement implemented?
Rows of data are pickled•
and sent from the
executor JVM process to
Python worker processes
This bottlenecks the•
data pipeline, but how
badly?
Many people avoid this•
problem by defining
their UDFs in Scala/Java
and calling them from
PySpark
JVM
Executor Python Workers
Rows (Pickle)
Rows (Pickle)
37. 37
PYSPARK PYTHON UDFS
Performance analysis of a basic UDF
Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow
Almost all of the time is•
spent serializing and
deserializing data as
opposed to the actual
calculations!
We can’t actually feed•
the GPU fast enough to
take advantage of the
performance benefits!
lambda x: x + 1
38. 38
PYSPARK 2.3
First release with Apache Arrow compatibility!
Apache
Arrow
spark.sql.execution.arrow.enabled à true
40. 40
PYSPARK 2.3 PANDAS UDFS
Vectorized user defined functions using Pandas
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series• in, Pandas.Series out
Input and output Series must be the same length•
Output Series must be of the type defined in the•
decorator
Pandas.DataFrame• in, Pandas.DataFrame out
Output• DataFrame can be any length
Output• DataFrame schema defined via a Spark
SQL DataFrame schema
41. 41
PYSPARK 2.3 PANDAS UDFS
PySpark data movement performance issues resolved
JVM
Executor Python Workers
Columnar
Record Batch
Columnar
Record Batch
Data is converted from•
rows to Apache Arrow
columnar record batches
within the executor JVM
processes
Data does• not have to
be serialized or
deserialized!
Apache
Arrow
42. 42
PYSPARK 2.3 PANDAS UDFS
No more serialization and deserialization overhead!
Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow
With the data movement•
performance issues resolved,
the bottleneck for many
UDFs gets pushed back to
the compute
We can utilize GPUs to help•
in this respect!
lambda x: x + 1
44. 44
PANDAS UDFS WITH GPUS
Pandas ↔ PyGDF makes this easy!
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
45. 45
PANDAS UDFS WITH GPUS
What about for more advanced operations?
Many UDFs are created because the function•
can’t be easily created using Spark primitives
Probably can’t be created with• PyGDF
primitives either
Writing low level code and tying it into your•
UDF is a non-starter
46. 46
PANDAS UDFS WITH GPUS
Numba to the rescue!
Luckily,• PyGDF has convenience functions for
Numba to JIT compile CUDA kernels for
optimized execution on the GPU
DataFrame.apply_rows• ()
Series.applymap• ()
UDFs within UDFS!•
48. 48
LESSONS LEARNED
GPU-Accelerated UDFs as hard to do right
Data needs to be large enough to utilize the GPU•
effectively, but not too large to exhaust GPU memory
(1e6 – 9e9)
The work done on the GPU needs to be substantial•
enough to prevent data transfer from dominating
execution time
I.E.• Group by a timestamp and run a Grouped
Map Pandas UDF of GPU-accelerated pagerank
per group
PyGDF• depends on Arrow 0.7.1 for now while PySpark
uses Arrow 0.8+, WIP to update dependency
https• ://github.com/kkraus14/libgdf/tree/temp_r
emove_ipc_arrow for temporary workaround
50. 50
PYGDF AND LIBGDF
Optimized join performance•
GDF Graph Analytics Library•
Support for multiple•
interconnected GPUs in LibGDF
and PyGDF (same PCIe root or
NVLink)
General• performance
improvements across the board
TIME (MS) SF1 SF10 SF100
CPU (single-threaded) 1329 31731 465064
V100 (PCIe3) 22 164 1521
V100 (3xNVLINK2) 12 45 466
3.2x
300x
TPCH Query 21 – End to End Results Using 32-bit Keys*
TIME (MS) SF1 SF10 SF100
CPU (single-threaded) 150 2041 24960
V100 (PCIe3) 13 105 946
V100 (3xNVLINK2) 7 23 308
3.1x
26x
TPCH Query 4 – End to End Results Using 32-bit Keys*
51. 51
NUMBA AND CUPY
Standard Python GPU N-Dimensional Array
Numba• and CuPy are unifying their GPU backends
to share an n-dimensional array implementation
Hoping to get additional Python libraries like•
PyCUDA, PyTorch, etc. to unify as well in the future
PyCUDA
52. 52
DASK.GDF AND DASK.CUPY
Scale out in addition to scaling up
Use• Dask as the scale out method for distributed
GPU data structures
Extend• Dask’s Kubernetes integration as needed to
support the full extent of GPU integration
Dask.GDF• is in the very early stages of development
https://github.com/gpuopenanalytics/dask_gdf
Dask.CuPy• has not started yet, but if interested
we’re hiring!
53. 53
SPARK 2.3+ WISHES
More Arrow-based Pandas UDF types
Partition Pandas UDFs
@pandas_udf(schema, PandasUDFType.PARTITION)
Pandas.DataFrame• in, Pandas.DataFrame out
Output• DataFrame can be any length
Output• DataFrame schema defined via a Spark
SQL DataFrame schema
54. 54
SPARK 2.3+ WISHES
Arrow as the primary data format for Spark DataFrame
Currently Spark can take advantage of columnar•
file formats and columnar data connections by
loading the necessary columns and pushing down
predicates
Most typical operations benefit from columnar data•
structure
Using Arrow will allow for optimized compute•
kernels and reduce the JVM dependency in the
future
Eventually native GPU acceleration•
Executor
55. 55
JOIN THE REVOLUTION
Everyone Can Help!
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
APACHE ARROW GPU Open Analytics
Initiative
https://arrow.apache.org/
@ApacheArrow
http://gpuopenanalytics.com/
@Gpuoai