RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
RAPIDS – Open GPU-accelerated Data Science
1. Adam Thompson | adamt@nvidia.com | Senior Solutions Architect
Corey Nolet | cnolet@nvidia.com | Data Scientist and Senior Engineer
GPU Enabled Data Science
2. 2
About Us
Adam Thompson @adamlikesai
Senior Solutions Architect at NVIDIA, Signals/RF SME
Research interests include applications of machine and deep learning to non-
traditional data types, HPC, and large system integration
Experience in programming GPUs for signal processing tasks since CUDA Toolkit 2
Roots for and
Corey Nolet @cjnolet
Data Scientist & Senior Engineer on the RAPIDS cuML team at NVIDIA
Focuses on building and scaling machine learning algorithms to support extreme
data loads at light-speed
Over a decade experience building massive-scale exploratory data science & real-
time analytics platforms for HPC environments in the defense industry
Working towards PhD in Computer Science, focused on unsupervised ML
3. 3
1980 1990 2000 2010 2020
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
GPU-Computing perf
1.5X per year 1000X
By 2025
REALITIES OF PERFORMANCE
GPU Performance Grows
5. 5
ML Workflow Stifles Innovation
It Requires Exploration and Iterations
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Accelerating just `Model Training` does have benefit but doesn’t address the whole problem
End-to-End acceleration is needed
Iterate …
Cross Validate …
Grid Search …
Iterate some more.
7. 7
GPU
Adoption
Barriers
• Too much data movement
• Too many makeshift data formats
• Writing CUDA C/C++ is hard
• No Python API for data manipulation
Yes GPUs are fast but …
8. 8
APP A
DATA MOVEMENT AND TRANSFORMATION
The bane of productivity and performance
CPU GPU
APP B
Read Data
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
APP A
APP B
9. 9
APP A
DATA MOVEMENT AND TRANSFORMATION
What if we could keep data on the GPU?
APP B
Copy & Convert
Copy & Convert
Copy & Convert
APP A GPU
Data
APP B
GPU
Data
Read Data
Load Data
APP B
CPU GPU
APP A
11. 11
Data Processing Evolution
Faster Data Access Less Data Movement
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
ReadQuery ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
50-100x Improvement
Same code
Language flexible
Primarily on GPU
RAPIDS
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
13. 13
cuDF
Analytics
GPU Memory
Data Preparation VisualizationModel Training
cuML
Machine Learning
cuGraph
Graph Analytics
PyTorch & Chainer
Deep Learning
Kepler.GL
Visualization
RAPIDS Open Source SOFTWARE
14. 14
RAPIDS: OPEN GPU DATA SCIENCE
cuDF, cuML, and cuGraph mimic well-known libraries
Data Preparation VisualizationModel Training
CUDA
PYTHON
DASK
DL
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
APACHE ARROW
Pandas-like
ScikitLearn-like
NetworkX-like
15. 15
Dask
What is Dask and why does RAPIDS use it for scaling out?
• Dask is a distributed computation scheduler
built to scale Python workloads from laptops to
supercomputer clusters.
• Extremely modular with scheduling, compute,
data transfer, and out-of-core handling all being
disjointed allowing us to plug in our own
implementations.
• Can easily run multiple Dask workers per node
to allow for an easier development model of
one worker per GPU regardless of single node
or multi node environment.
16. 16
Dask
Scale up and out with cuDF
• Use cuDF primitives underneath in map-reduce style
operations with the same high level API
• Instead of using typical Dask data movement of
pickling objects and sending via TCP sockets, take
advantage of hardware advancements using a
communications framework called OpenUCX:
• For intranode data movement, utilize NVLink
and PCIe peer-to-peer communications
• For internode data movement, utilize GPU
RDMA over Infiniband and RoCE https://github.com/rapidsai/dask_gdf
17. 17
Faster speeds, real world benefits
2,290
1,956
1,999
1,948
169
157
0 500 1,000 1,500 2,000 2,500
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
cuML — XGBoost
2,741
1,675
715
379
42
19
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
End-to-End
cuIO/cuDF —
Load and Data Preparation
Benchmark
200GB CSV dataset; Data preparation
includes joins, variable transformations.
CPU Cluster Configuration
CPU nodes (61 GiB of memory, 8 vCPUs,
64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
Time in seconds — Shorter is better
cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost
8,762
6,148
3,925
3,221
322
213
18. 18
AI Libraries
Accelerating more of the AI ecosystem
Graph Analytics is fundamental to network analysis
Machine Learning is fundamental to prediction,
classification, clustering, anomaly detection and
recommendations.
Both can be accelerated with NVIDIA GPU
8x V100 20-90x faster than dual socket CPU
Decisions Trees
Random Forests
Linear Regressions
Logistics Regressions
K-Means
K-Nearest Neighbor
DBSCAN
Kalman Filtering
Principal Components
Single Value Decomposition
Bayesian Inferencing
PageRank
BFS
Jaccard Similarity
Single Source Shortest Path
Triangle Counting
Louvain Modularity
ARIMA
Holt-Winters
Machine Learning Graph Analytics
Time Series
XGBoost, Mortgage Dataset, 90x
3 Hours to 2 mins on 1 DGX-1
cuML & cuGraph
20. 20
RAPIDS: OPEN GPU DATA SCIENCE
Following the Pandas API for optimal code reuse
Data Preparation VisualizationModel Training
CUDA
PYTHON
DASK
DL
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
APACHE ARROW
Pandas-like
import pandas as pd
import cudf
…
df = pd.read_csv('foo.csv', names=['index', 'A'], dtype=['date', 'float64'])
gdf = cudf.read_csv('foo.csv', names=['index', 'A'], dtype=['date', 'float64’])
…
# eventually majority of pandas features will be available in cudf
21. 21
cuDF
GPU DataFrame library
• Apache Arrow data format
• Pandas-like API
• Unary and Binary Operations
• Joins / Merges
• GroupBys
• Filters
• User-Defined Functions (UDFs)
• Accelerated file readers
• Much, much more
22. 22
CUDF
Today
CUDA With Python Bindings
• Low level library containing function
implementations and C/C++ API
• Importing/exporting Apache Arrow using the
CUDA IPC mechanism
• CUDA kernels to perform element-wise math
operations on GPU DataFrame columns
• CUDA sort, join, groupby, and reduction
operations on GPU DataFrames
• A Python library for manipulating GPU
DataFrames
• Python interface to CUDA C++ with additional
functionality
• Creating Apache Arrow from Numpy arrays,
Pandas DataFrames, and PyArrow Tables
• JIT compilation of User-Defined Functions
(UDFs) using Numba
23. 23
GPU-Accelerated string functions with a Pandas-like API
cuString & NVSTring
• API and functionality is following Pandas: https://pandas.pydata.org/pandas-
docs/stable/api.html#string-handling
• lower()
• ~22x speedup
• find()
• ~40x speedup
• slice()
• ~100x speedup
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
lower() find(#) slice(1,15)
milliseconds
Pandas cudastrings
24. 24
Next few months
GPU Dataframe
• Continue improving performance and functionality
• Single GPU
• Single node, multi GPU
• Multi node, multi GPU
• String Support
• Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas
• Accelerated Data Loading
• File formats: CSV, Parquet, ORC – to start
27. 27
Early cugraph benchmarks
NVGRAPH methods (not currently in cuGraph)
100 billion TEPS (single GPU)
Can reach 300 billion TEPS on MG (4 GPUs) with hand-tuning
PageRank (SG) on graph with 5.5m edges and 51k nodes with RAPIDS
CPU (GraphFrames, Spark) [576 GB memory, 384 vcores]: 172 seconds
GPU (cuGraph, V100) [32 GB memory, 5120 CUDA cores]: 671 ms
Speed-up of 256x
For large graphs, there are huge parallelization benefits
32. 32
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
33. 33
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
34. 34
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
35. 35
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
36. 36
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
Iterate.
37. 37
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
Iterate. Cross Validate.
38. 38
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
Iterate. Cross Validate & Grid Search.
39. 39
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
Iterate. Cross Validate & Grid Search.
Iterate some more.
40. 40
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
Iterate. Cross Validate & Grid Search.
Iterate some more.
Meet reasonable speed vs accuracy tradeoff
41. 41
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
Iterate. Cross Validate & Grid Search.
Iterate some more.
Meet reasonable speed vs accuracy tradeoff
Time
Increases
42. 42
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
Iterate. Cross Validate & Grid Search.
Iterate some more.
Meet reasonable speed vs accuracy tradeoff
Hours?
Time
Increases
43. 43
Problem
Data sizes continue to grow
Histograms / Distributions
Dimension Reduction
Feature Selection
Remove Outliers
Sampling
Massive Dataset
Better to start with as much data as
possible and explore / preprocess to scale
to performance needs.
Iterate. Cross Validate & Grid Search.
Iterate some more.
Meet reasonable speed vs accuracy tradeoff
Hours? Days?
Time
Increases
45. 45
cuML API
Python
Algorithms
Primitives
GPU-accelerated machine learning at every layer
Scikit-learn-like interface for data scientists utilizing
cuDF & Numpy
C++ API for developers to utilize accelerated machine
learning algorithms.
Reusable building blocks for composing machine learning
algorithms.
46. 46
Linear Algebra
Primitives
GPU-accelerated math optimized for feature matrices
Statistics
Sparse
Random
Distance / Metrics
Optimizers
More to come!
• Element-wise operations
• Matrix multiply
• Norms
• Eigen Decomposition
• SVD/RSVD
• Transpose
• QR Decomposition
47. 47
Algorithms
GPU-accelerated Scikit-Learn
Classification / Regression
Statistical Inference
Clustering
Decomposition & Dimensionality Reduction
Timeseries Forecasting
Recommendations
Decision Trees / Random Forests
Linear Regression
Logistic Regression
K-Nearest Neighbors
Kalman Filtering
Bayesian Inference
Gaussian Mixture Models
Hidden Markov Models
K-Means
DBSCAN
Spectral Clustering
Principal Components
Singular Value Decomposition
UMAP
Spectral Embedding
ARIMA
Holt-Winters
Implicit Matrix Factorization
Cross Validation
More to come!
Hyper-parameter Tuning
48. 48
HIGH-LEVEL APIs
CUDA/C++
Multi-Node & Multi-GPU Communications
ML Primitives
ML Algorithms
Python
Dask Multi-GPU ML
Scikit-Learn-Like
Host 2
GPU1 GPU3
GPU2 GPU4
Host 1
GPU1 GPU3
GPU2 GPU4
49. 49
HIGH-LEVEL APIs
CUDA/C++
Multi-Node / Multi-GPU Communications
ML Primitives
ML Algorithms
Python
Dask Multi-GPU ML
Scikit-Learn-Like
Host 2
GPU1 GPU3
GPU2 GPU4
Host 1
GPU1 GPU3
GPU2 GPU4
Data Parallelism
Model Parallelism
57. 57
Dimensionality Reduction
pca = PCA(n_components = 2)
pca.fit(df)
X = pca.transform(df)
Code Example
uniq = set(labels.values)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(uniq))]
for k, col in zip(uniq, colors):
c = X[l==k].as_gpu_matrix()
plt.plot(c[:,0], c[:,1], '.',
markerfacecolor=tuple(col))
Reduce Dimensions
Chart Rotated Data
58. 58
Dimensionality Reduction
pca = PCA(n_components = 2)
pca.fit(df)
X = pca.transform(df)
Code Example
uniq = set(labels.values)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(uniq))]
for k, col in zip(uniq, colors):
c = X[l==k].as_gpu_matrix()
plt.plot(c[:,0], c[:,1], '.',
markerfacecolor=tuple(col))
Reduce Dimensions
Chart Rotated Data
59. 59
Dimensionality Reduction
pca = PCA(n_components = 2)
pca.fit(df)
X = pca.transform(df)
Code Example
uniq = set(labels.values)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(uniq))]
for k, col in zip(uniq, colors):
c = X[l==k].as_gpu_matrix()
plt.plot(c[:,0], c[:,1], '.',
markerfacecolor=tuple(col))
Reduce Dimensions
Chart Rotated Data
60. 60
Clustering
Code Example
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X_df = pd.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
X = cudf.DataFrame.from_pandas(X_df)
Load Moons Dataset and Preprocess
61. 61
Clustering
Code Example
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X_df = pd.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
X = cudf.DataFrame.from_pandas(X_df)
Load Moons Dataset and Preprocess
62. 62
Clustering
Code Example
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X_df = pd.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
X = cudf.DataFrame.from_pandas(X_df)
Load Moons Dataset and Preprocess
63. 63
Clustering
from cuml import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = db.fit_predict(X)
Code Example
Find Clusters
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X_df = pd.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
X = cudf.DataFrame.from_pandas(X_df)
Load Moons Dataset and Preprocess
64. 64
Clustering
from cuml import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 5)
dbscan.fit(X)
y_hat = db.fit_predict(X)
Code Example
Find Clusters
X, y = make_moons(n_samples=int(1e2),
noise=0.05, random_state=0)
X_df = pd.DataFrame({'fea%d'%i: X[:, i]
for i in range(X.shape[1])})
X = cudf.DataFrame.from_pandas(X_df)
Load Moons Dataset and Preprocess
66. 66
cuDF + XGBoost
DGX-2 vs Scale Out CPU Cluster
• Full end to end pipeline
• Leveraging Dask + cuDF
• Store each GPU results in sys mem then read back in
• Arrow to Dmatrix (CSR) for XGBoost
67. 67
cuDF + XGBoost
Scale Out GPU Cluster vs DGX-2
0 50 100 150 200 250 300 350
5xDGX-1
DGX-2
Chart Title
ETL+CSV (s) ML Prep (s) ML (s)
• Full end to end pipeline
• Leveraging Dask for multi-node + cuDF
• Store each GPU results in sys mem then read back in
• Arrow to Dmatrix (CSR) for XGBoost
68. 68
cuDF + XGBoost
Fully In- GPU Benchmarks
• Full end to end pipeline
• Leveraging Dask cuDF
• No Data Prep time all in memory
• Arrow to Dmatrix (CSR) for XGBoost