SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
GPU-ACCELERATING UDFS IN
PYSPARK WITH NUMBA AND PYGDF
Joshua Patterson @datametrician
Keith Kraus @keithjkraus
2
THE DATA
STRUGGLE IS
REAL…
3
DATA DELUGE TO INSIGHT HUNGRY
INCREASING DATA VARIETY
Search
Marketing
Behavioral
Targeting
Dynamic
Funnels
User
Generated
Content
Mobile Web
SMS/MMS
Sentiment
HD Video
Speech To
Text
Product/
Service Logs
Social
Network
Business
Data Feeds
User Click
Stream
Sensors Infotainment
Systems
Wearable
Devices
Cyber
Security Logs
Connected
Vehicles
Machine
Data
IoT Data
Dynamic
Pricing
Payment
Record
Purchase
Detail
Purchase
Record
Support
Contacts
Segmentation
Offer
Details
Web
Logs
Offer
History
A/B
Testing
BUSINESS
PROCESS
PETABYTESTERABYTESGIGABYTESEXABYTESZETTABYTES
Streaming
Video
Natural
Language
Processing
WEB
DIGITAL
AI
4
DATA FORMATS
Avro
XML
JSON
GML
ProtoBuf
HDFS
Pickle
CSV
Parquet
Pandas
Plain Text vs Binary
Compressed vs Uncompressed
CSR
COO
CSC
* Not a complete list
Numpy
5
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
6
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
25-100x
Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing
7
Cluster computing framework
Spark has almost become synonymous with Hadoop and Big Data•
Integrates with nearly the entire Big Data ecosystem•
The processing layer for big data and leading ML framework•
Five main components RDD API, SQL, Streaming,• MLlib, and GraphX
APACHE SPARK
8
SPARK IS NOT ENOUGH
Basic workloads are bottlenecked by the CPU
Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR
In a simple benchmark consisting•
of aggregating data, the CPU is
the bottleneck
This is after the data is parsed and•
cached into memory which is
another common bottleneck
The CPU bottleneck is even worse•
in more complex workloads!
SELECT cab_type, count(*) FROM
trips_orc GROUP BY cab_type;
9
SPARK ECOSYSTEM
Lacks Full GPU Integration
4 Core Parts• : SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph
Spark is currently optimizing its existing code base, adding more usability, not GPU support yet•
10
SPARK ECOSYSTEM
Using• Numba, Microsoft Azure team released a
basic example showing a ~5x speedup using
GPUs with Spark
This example is extremely limited in that•
they’re not passing any real data to the Python
process or the GPU
When wanting to pass data from Spark to the•
GPU there are new issues and performance
considerations
GPU-Acceleration Possible But Not Ideal
Source: https://github.com/Azure/aztk/blob/master/node_scripts/jupyter-
samples/GPU%2Bvs%2BCPU%2Busing%2BNumba.ipynb
11
GPUS FTW!
12
GPUS ARE FAST
1.1 Billion Taxi Ride Benchmark
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942
13
GPUS ARE FAST
K-Means Benchmark
10 with latest solver
14
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
15
GPU ACCELERATED TECHNOLOGIES
GRAPH
PROCESSING
ANALYTICS
GPU DATABASES
16
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai Graphistry
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
BlazingDB MapDSimantex
Anaconda GunrocknvGRAPH
17
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
Simantex
Read DataH2O.ai Graphistry
Anaconda GunrocknvGRAPH
18
APACHE ARROW COMMON DATA LAYER
From Apache Arrow Home Page - https://arrow.apache.org/
19
GPU-ACCELERATED ARCHITECTURE NOW
Single data format and shared access to data on GPU
CPU GPU
GPU
MEM
Read Data
BlazingDB MapD Load Data
Apache Arrow
Powered by:
GPU Data Frame
Simantex
H2O.ai Graphistry
Anaconda GunrocknvGRAPH
20
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
GPU DATA FRAME
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GOAI)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
21
GPU OPEN ANALYTICS INITIATIVE
First Project, the GPU Data Frame
No Copy & Converts - Full
Interoperability
H2O.ai
Numba Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
GPU Data Frame is the first project of GOAI•
Apache Arrow for GPU•
libgdf• : A C library of helper functions, including:
Copying the GDF metadata block to the host and parsing it•
to a host-side struct.
Importing/exporting a GDF using the CUDA IPC mechanism.•
CUDA kernels to perform element• -wise math operations on
GDF columns.
CUDA sort, join, and reduction operations on GDFs.•
pygdf• : A Python library for manipulating GDFs
Python interface to• libgdf library with additional
functionality
Creating GDFs from• Numpy arrays and Pandas DataFrames
JIT compilation of group by and filter kernels using• Numba
dask_gdf• : Extension for Dask to work with distributed GDFs.
Same operations as• pygdf, but working on GDFs chunked
onto different GPUs and different servers.
Will bring the same Kubernetes support that• Dask already
has.
github.com/gpuopenanalytics
nvGRAPH
Apache Arrow
Powered by:
Simantex
22
GOAI ECOSYSTEM
GRAPH
PROCESSING
ANALYTICS
GPU DATABASES
Apache Arrow
Powered by:
23
GPU ACCELERATION ACROSS THE ECOSYSTEM
Apache Arrow
H2O.ai
Numba Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
nvGRAPH
Apache Arrow
Powered by:
Simantex
24
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GoAi)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
25
PYTHON GPU
DATAFRAME
26
PYGDF @gpuoai
Python GPU DataFrame library
27
PYGDF @gpuoai
Pandas ↔ PyGDF
28
PYGDF @gpuoai
Built-In Functions
29
APACHE SPARK
30
Cluster computing framework
APACHE SPARK
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
31
PYSPARK
Python API for Spark
32
PYSPARK
No cluster execution in Python if using Spark built-ins
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
33
PYSPARK UDFS
When Spark built-ins can’t get the job done alone
User defined functions (UDFs)•
allow for creating column-based
functions outside of the scope of
Spark built-in functions
UDFs can be defined in Scala/Java•
or Python and be called from
PySpark
Using Python lambdas in map•
functions is essentially the same
as using a Python UDF
34
PYSPARK PYTHON UDFS
Python UDFs in PySpark need Python workers and data movement
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
35
PYSPARK PYTHON UDFS
Moving data from the JVM to Python efficiently is hard
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
36
PYSPARK PYTHON UDFS
How is the data movement implemented?
Rows of data are pickled•
and sent from the
executor JVM process to
Python worker processes
This bottlenecks the•
data pipeline, but how
badly?
Many people avoid this•
problem by defining
their UDFs in Scala/Java
and calling them from
PySpark
JVM
Executor Python Workers
Rows (Pickle)
Rows (Pickle)
37
PYSPARK PYTHON UDFS
Performance analysis of a basic UDF
Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow
Almost all of the time is•
spent serializing and
deserializing data as
opposed to the actual
calculations!
We can’t actually feed•
the GPU fast enough to
take advantage of the
performance benefits!
lambda x: x + 1
38
PYSPARK 2.3
First release with Apache Arrow compatibility!
Apache
Arrow
spark.sql.execution.arrow.enabled à true
39
PYSPARK 2.3 PANDAS
Optimized Spark Data Frame ↔ Pandas Data Frame
df.toPandas()
createDataFrame(pdf)
40
PYSPARK 2.3 PANDAS UDFS
Vectorized user defined functions using Pandas
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series• in, Pandas.Series out
Input and output Series must be the same length•
Output Series must be of the type defined in the•
decorator
Pandas.DataFrame• in, Pandas.DataFrame out
Output• DataFrame can be any length
Output• DataFrame schema defined via a Spark
SQL DataFrame schema
41
PYSPARK 2.3 PANDAS UDFS
PySpark data movement performance issues resolved
JVM
Executor Python Workers
Columnar
Record Batch
Columnar
Record Batch
Data is converted from•
rows to Apache Arrow
columnar record batches
within the executor JVM
processes
Data does• not have to
be serialized or
deserialized!
Apache
Arrow
42
PYSPARK 2.3 PANDAS UDFS
No more serialization and deserialization overhead!
Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow
With the data movement•
performance issues resolved,
the bottleneck for many
UDFs gets pushed back to
the compute
We can utilize GPUs to help•
in this respect!
lambda x: x + 1
43
APACHE SPARK
WITH PYGDF
44
PANDAS UDFS WITH GPUS
Pandas ↔ PyGDF makes this easy!
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
45
PANDAS UDFS WITH GPUS
What about for more advanced operations?
Many UDFs are created because the function•
can’t be easily created using Spark primitives
Probably can’t be created with• PyGDF
primitives either
Writing low level code and tying it into your•
UDF is a non-starter
46
PANDAS UDFS WITH GPUS
Numba to the rescue!
Luckily,• PyGDF has convenience functions for
Numba to JIT compile CUDA kernels for
optimized execution on the GPU
DataFrame.apply_rows• ()
Series.applymap• ()
UDFs within UDFS!•
47
PANDAS UDFS WITH GPUS
Numba GPU-Accelerated PyGDF UDFs in Pandas UDFs
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
48
LESSONS LEARNED
GPU-Accelerated UDFs as hard to do right
Data needs to be large enough to utilize the GPU•
effectively, but not too large to exhaust GPU memory
(1e6 – 9e9)
The work done on the GPU needs to be substantial•
enough to prevent data transfer from dominating
execution time
I.E.• Group by a timestamp and run a Grouped
Map Pandas UDF of GPU-accelerated pagerank
per group
PyGDF• depends on Arrow 0.7.1 for now while PySpark
uses Arrow 0.8+, WIP to update dependency
https• ://github.com/kkraus14/libgdf/tree/temp_r
emove_ipc_arrow for temporary workaround
49
FUTURE
50
PYGDF AND LIBGDF
Optimized join performance•
GDF Graph Analytics Library•
Support for multiple•
interconnected GPUs in LibGDF
and PyGDF (same PCIe root or
NVLink)
General• performance
improvements across the board
TIME (MS) SF1 SF10 SF100
CPU (single-threaded) 1329 31731 465064
V100 (PCIe3) 22 164 1521
V100 (3xNVLINK2) 12 45 466
3.2x
300x
TPCH Query 21 – End to End Results Using 32-bit Keys*
TIME (MS) SF1 SF10 SF100
CPU (single-threaded) 150 2041 24960
V100 (PCIe3) 13 105 946
V100 (3xNVLINK2) 7 23 308
3.1x
26x
TPCH Query 4 – End to End Results Using 32-bit Keys*
51
NUMBA AND CUPY
Standard Python GPU N-Dimensional Array
Numba• and CuPy are unifying their GPU backends
to share an n-dimensional array implementation
Hoping to get additional Python libraries like•
PyCUDA, PyTorch, etc. to unify as well in the future
PyCUDA
52
DASK.GDF AND DASK.CUPY
Scale out in addition to scaling up
Use• Dask as the scale out method for distributed
GPU data structures
Extend• Dask’s Kubernetes integration as needed to
support the full extent of GPU integration
Dask.GDF• is in the very early stages of development
https://github.com/gpuopenanalytics/dask_gdf
Dask.CuPy• has not started yet, but if interested
we’re hiring!
53
SPARK 2.3+ WISHES
More Arrow-based Pandas UDF types
Partition Pandas UDFs
@pandas_udf(schema, PandasUDFType.PARTITION)
Pandas.DataFrame• in, Pandas.DataFrame out
Output• DataFrame can be any length
Output• DataFrame schema defined via a Spark
SQL DataFrame schema
54
SPARK 2.3+ WISHES
Arrow as the primary data format for Spark DataFrame
Currently Spark can take advantage of columnar•
file formats and columnar data connections by
loading the necessary columns and pushing down
predicates
Most typical operations benefit from columnar data•
structure
Using Arrow will allow for optimized compute•
kernels and reduce the JVM dependency in the
future
Eventually native GPU acceleration•
Executor
55
JOIN THE REVOLUTION
Everyone Can Help!
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
APACHE ARROW GPU Open Analytics
Initiative
https://arrow.apache.org/
@ApacheArrow
http://gpuopenanalytics.com/
@Gpuoai
Joshua Patterson @datametrician
Keith Kraus @keithjkraus
QUESTIONS?

Contenu connexe

Tendances

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
confluent
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 

Tendances (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
 
Reliability Guarantees for Apache Kafka
Reliability Guarantees for Apache KafkaReliability Guarantees for Apache Kafka
Reliability Guarantees for Apache Kafka
 
eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
cilium-public.pdf
cilium-public.pdfcilium-public.pdf
cilium-public.pdf
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable,  Robust Kafka ReplicatoruReplicator: Uber Engineering’s Scalable,  Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 

Similaire à GPU-Accelerating UDFs in PySpark with Numba and PyGDF

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 

Similaire à GPU-Accelerating UDFs in PySpark with Numba and PyGDF (20)

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
BlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow DemoBlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow Demo
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
 
Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 

Dernier

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

Dernier (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

GPU-Accelerating UDFs in PySpark with Numba and PyGDF

  • 1. GPU-ACCELERATING UDFS IN PYSPARK WITH NUMBA AND PYGDF Joshua Patterson @datametrician Keith Kraus @keithjkraus
  • 3. 3 DATA DELUGE TO INSIGHT HUNGRY INCREASING DATA VARIETY Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMS Sentiment HD Video Speech To Text Product/ Service Logs Social Network Business Data Feeds User Click Stream Sensors Infotainment Systems Wearable Devices Cyber Security Logs Connected Vehicles Machine Data IoT Data Dynamic Pricing Payment Record Purchase Detail Purchase Record Support Contacts Segmentation Offer Details Web Logs Offer History A/B Testing BUSINESS PROCESS PETABYTESTERABYTESGIGABYTESEXABYTESZETTABYTES Streaming Video Natural Language Processing WEB DIGITAL AI
  • 4. 4 DATA FORMATS Avro XML JSON GML ProtoBuf HDFS Pickle CSV Parquet Pandas Plain Text vs Binary Compressed vs Uncompressed CSR COO CSC * Not a complete list Numpy
  • 5. 5 DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk
  • 6. 6 DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk 25-100x Improvement Less code Language flexible Primarily In-Memory Spark In-Memory Processing
  • 7. 7 Cluster computing framework Spark has almost become synonymous with Hadoop and Big Data• Integrates with nearly the entire Big Data ecosystem• The processing layer for big data and leading ML framework• Five main components RDD API, SQL, Streaming,• MLlib, and GraphX APACHE SPARK
  • 8. 8 SPARK IS NOT ENOUGH Basic workloads are bottlenecked by the CPU Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR In a simple benchmark consisting• of aggregating data, the CPU is the bottleneck This is after the data is parsed and• cached into memory which is another common bottleneck The CPU bottleneck is even worse• in more complex workloads! SELECT cab_type, count(*) FROM trips_orc GROUP BY cab_type;
  • 9. 9 SPARK ECOSYSTEM Lacks Full GPU Integration 4 Core Parts• : SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph Spark is currently optimizing its existing code base, adding more usability, not GPU support yet•
  • 10. 10 SPARK ECOSYSTEM Using• Numba, Microsoft Azure team released a basic example showing a ~5x speedup using GPUs with Spark This example is extremely limited in that• they’re not passing any real data to the Python process or the GPU When wanting to pass data from Spark to the• GPU there are new issues and performance considerations GPU-Acceleration Possible But Not Ideal Source: https://github.com/Azure/aztk/blob/master/node_scripts/jupyter- samples/GPU%2Bvs%2BCPU%2Busing%2BNumba.ipynb
  • 12. 12 GPUS ARE FAST 1.1 Billion Taxi Ride Benchmark 21 30 1560 80 99 1250 150 269 2250 372 696 2970 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node Query 1 Query 2 Query 3 Query 4 TimeinMilliseconds Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82 10190 8134 19624 85942
  • 13. 13 GPUS ARE FAST K-Means Benchmark 10 with latest solver
  • 14. 14 25-100x Improvement Less code Language flexible Primarily In-Memory DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More code Language rigid Substantially on GPU GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 16. 16 APP A GPU-ACCELERATED ARCHITECTURE THEN Too much data movement and too many different data formats CPU GPU APP B Read DataH2O.ai Graphistry Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data BlazingDB MapDSimantex Anaconda GunrocknvGRAPH
  • 17. 17 APP A GPU-ACCELERATED ARCHITECTURE THEN Too much data movement and too many different data formats CPU GPU APP B BlazingDB MapD Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data Simantex Read DataH2O.ai Graphistry Anaconda GunrocknvGRAPH
  • 18. 18 APACHE ARROW COMMON DATA LAYER From Apache Arrow Home Page - https://arrow.apache.org/
  • 19. 19 GPU-ACCELERATED ARCHITECTURE NOW Single data format and shared access to data on GPU CPU GPU GPU MEM Read Data BlazingDB MapD Load Data Apache Arrow Powered by: GPU Data Frame Simantex H2O.ai Graphistry Anaconda GunrocknvGRAPH
  • 20. 20 25-100x Improvement Less code Language flexible Primarily In-Memory GPU DATA FRAME Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 25-100x Improvement Same code Language flexible Primarily on GPU End to End GPU Processing (GOAI) GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 21. 21 GPU OPEN ANALYTICS INITIATIVE First Project, the GPU Data Frame No Copy & Converts - Full Interoperability H2O.ai Numba Gunrock Graphistry BlazingDB MapD GPU Data Frame GPU Data Frame is the first project of GOAI• Apache Arrow for GPU• libgdf• : A C library of helper functions, including: Copying the GDF metadata block to the host and parsing it• to a host-side struct. Importing/exporting a GDF using the CUDA IPC mechanism.• CUDA kernels to perform element• -wise math operations on GDF columns. CUDA sort, join, and reduction operations on GDFs.• pygdf• : A Python library for manipulating GDFs Python interface to• libgdf library with additional functionality Creating GDFs from• Numpy arrays and Pandas DataFrames JIT compilation of group by and filter kernels using• Numba dask_gdf• : Extension for Dask to work with distributed GDFs. Same operations as• pygdf, but working on GDFs chunked onto different GPUs and different servers. Will bring the same Kubernetes support that• Dask already has. github.com/gpuopenanalytics nvGRAPH Apache Arrow Powered by: Simantex
  • 23. 23 GPU ACCELERATION ACROSS THE ECOSYSTEM Apache Arrow H2O.ai Numba Gunrock Graphistry BlazingDB MapD GPU Data Frame nvGRAPH Apache Arrow Powered by: Simantex
  • 24. 24 25-100x Improvement Less code Language flexible Primarily In-Memory DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 25-100x Improvement Same code Language flexible Primarily on GPU End to End GPU Processing (GoAi) GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 26. 26 PYGDF @gpuoai Python GPU DataFrame library
  • 30. 30 Cluster computing framework APACHE SPARK JVM Local Cluster Local Code Spark Context JVM JVM
  • 32. 32 PYSPARK No cluster execution in Python if using Spark built-ins JVM Local Cluster Local Code Spark Context JVM JVM
  • 33. 33 PYSPARK UDFS When Spark built-ins can’t get the job done alone User defined functions (UDFs)• allow for creating column-based functions outside of the scope of Spark built-in functions UDFs can be defined in Scala/Java• or Python and be called from PySpark Using Python lambdas in map• functions is essentially the same as using a Python UDF
  • 34. 34 PYSPARK PYTHON UDFS Python UDFs in PySpark need Python workers and data movement JVM Local Cluster Local Code Spark Context JVM JVM
  • 35. 35 PYSPARK PYTHON UDFS Moving data from the JVM to Python efficiently is hard JVM Local Cluster Local Code Spark Context JVM JVM
  • 36. 36 PYSPARK PYTHON UDFS How is the data movement implemented? Rows of data are pickled• and sent from the executor JVM process to Python worker processes This bottlenecks the• data pipeline, but how badly? Many people avoid this• problem by defining their UDFs in Scala/Java and calling them from PySpark JVM Executor Python Workers Rows (Pickle) Rows (Pickle)
  • 37. 37 PYSPARK PYTHON UDFS Performance analysis of a basic UDF Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow Almost all of the time is• spent serializing and deserializing data as opposed to the actual calculations! We can’t actually feed• the GPU fast enough to take advantage of the performance benefits! lambda x: x + 1
  • 38. 38 PYSPARK 2.3 First release with Apache Arrow compatibility! Apache Arrow spark.sql.execution.arrow.enabled à true
  • 39. 39 PYSPARK 2.3 PANDAS Optimized Spark Data Frame ↔ Pandas Data Frame df.toPandas() createDataFrame(pdf)
  • 40. 40 PYSPARK 2.3 PANDAS UDFS Vectorized user defined functions using Pandas Scalar Pandas UDFs Grouped Map Pandas UDFs @pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR) Pandas.Series• in, Pandas.Series out Input and output Series must be the same length• Output Series must be of the type defined in the• decorator Pandas.DataFrame• in, Pandas.DataFrame out Output• DataFrame can be any length Output• DataFrame schema defined via a Spark SQL DataFrame schema
  • 41. 41 PYSPARK 2.3 PANDAS UDFS PySpark data movement performance issues resolved JVM Executor Python Workers Columnar Record Batch Columnar Record Batch Data is converted from• rows to Apache Arrow columnar record batches within the executor JVM processes Data does• not have to be serialized or deserialized! Apache Arrow
  • 42. 42 PYSPARK 2.3 PANDAS UDFS No more serialization and deserialization overhead! Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow With the data movement• performance issues resolved, the bottleneck for many UDFs gets pushed back to the compute We can utilize GPUs to help• in this respect! lambda x: x + 1
  • 44. 44 PANDAS UDFS WITH GPUS Pandas ↔ PyGDF makes this easy! Scalar Pandas UDFs Grouped Map Pandas UDFs @pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR) Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
  • 45. 45 PANDAS UDFS WITH GPUS What about for more advanced operations? Many UDFs are created because the function• can’t be easily created using Spark primitives Probably can’t be created with• PyGDF primitives either Writing low level code and tying it into your• UDF is a non-starter
  • 46. 46 PANDAS UDFS WITH GPUS Numba to the rescue! Luckily,• PyGDF has convenience functions for Numba to JIT compile CUDA kernels for optimized execution on the GPU DataFrame.apply_rows• () Series.applymap• () UDFs within UDFS!•
  • 47. 47 PANDAS UDFS WITH GPUS Numba GPU-Accelerated PyGDF UDFs in Pandas UDFs Scalar Pandas UDFs Grouped Map Pandas UDFs @pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR) Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
  • 48. 48 LESSONS LEARNED GPU-Accelerated UDFs as hard to do right Data needs to be large enough to utilize the GPU• effectively, but not too large to exhaust GPU memory (1e6 – 9e9) The work done on the GPU needs to be substantial• enough to prevent data transfer from dominating execution time I.E.• Group by a timestamp and run a Grouped Map Pandas UDF of GPU-accelerated pagerank per group PyGDF• depends on Arrow 0.7.1 for now while PySpark uses Arrow 0.8+, WIP to update dependency https• ://github.com/kkraus14/libgdf/tree/temp_r emove_ipc_arrow for temporary workaround
  • 50. 50 PYGDF AND LIBGDF Optimized join performance• GDF Graph Analytics Library• Support for multiple• interconnected GPUs in LibGDF and PyGDF (same PCIe root or NVLink) General• performance improvements across the board TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 1329 31731 465064 V100 (PCIe3) 22 164 1521 V100 (3xNVLINK2) 12 45 466 3.2x 300x TPCH Query 21 – End to End Results Using 32-bit Keys* TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 150 2041 24960 V100 (PCIe3) 13 105 946 V100 (3xNVLINK2) 7 23 308 3.1x 26x TPCH Query 4 – End to End Results Using 32-bit Keys*
  • 51. 51 NUMBA AND CUPY Standard Python GPU N-Dimensional Array Numba• and CuPy are unifying their GPU backends to share an n-dimensional array implementation Hoping to get additional Python libraries like• PyCUDA, PyTorch, etc. to unify as well in the future PyCUDA
  • 52. 52 DASK.GDF AND DASK.CUPY Scale out in addition to scaling up Use• Dask as the scale out method for distributed GPU data structures Extend• Dask’s Kubernetes integration as needed to support the full extent of GPU integration Dask.GDF• is in the very early stages of development https://github.com/gpuopenanalytics/dask_gdf Dask.CuPy• has not started yet, but if interested we’re hiring!
  • 53. 53 SPARK 2.3+ WISHES More Arrow-based Pandas UDF types Partition Pandas UDFs @pandas_udf(schema, PandasUDFType.PARTITION) Pandas.DataFrame• in, Pandas.DataFrame out Output• DataFrame can be any length Output• DataFrame schema defined via a Spark SQL DataFrame schema
  • 54. 54 SPARK 2.3+ WISHES Arrow as the primary data format for Spark DataFrame Currently Spark can take advantage of columnar• file formats and columnar data connections by loading the necessary columns and pushing down predicates Most typical operations benefit from columnar data• structure Using Arrow will allow for optimized compute• kernels and reduce the JVM dependency in the future Eventually native GPU acceleration• Executor
  • 55. 55 JOIN THE REVOLUTION Everyone Can Help! Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed! APACHE ARROW GPU Open Analytics Initiative https://arrow.apache.org/ @ApacheArrow http://gpuopenanalytics.com/ @Gpuoai
  • 56. Joshua Patterson @datametrician Keith Kraus @keithjkraus QUESTIONS?