Все мы знаем, что наш любимый Pandas исключительно однопоточный, а модели из scikit-learn часто учатся не очень быстро даже в несколько процессов. Поэтому в докладе я расскажу о проекте RAPIDS - наборе библиотек для анализа данных и построения предиктивных моделей с использованием NVIDIA GPU. В докладе я предложу подискутировать о том, что закон Мура больше не выполняется, рассмотрю принципы работы архитектуры CUDA. Разберу библиотеки cuDF и cuML, а также постараюсь предельно честно рассказать о том, ждать ли чуда от перехода на GPU и в каких случаях чудо неизбежно.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
1. Pavel Klemenkov, Chief Data Scientist @ NVIDIA
RAPIDS: SPEEDING UP
PANDAS AND SCIKIT-LEARN
2. 2
TYPICAL DS PIPELINE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Can we test more
hypothesis per unit of
time?
3. 3
TYPICAL DS PIPELINE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Can we test more
hypothesis per unit of
time?
Hyperparameters
optimization
4. 4
RAPIDS — OPEN GPU DATA SCIENCE
Software Stack Python
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
7. 7
def randChar(f, numGrp, N):
things = [f.format(x) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
def randFloat(numGrp, N) :
things = [round(100 * np.random.random(), 4) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N = int(1e7)
K = 100
pdf = pd.DataFrame({
'id1' : randChar("id{0:0=3d}", K, N), # large groups (char)
'id2' : randChar("id{0:0=3d}", K, N), # large groups (char)
'id3' : randChar("id{0:0=3d}", N//K, N), # small groups (char)
'id4' : np.random.choice(K, N), # large groups (int)
'id5' : np.random.choice(K, N), # large groups (int)
'id6' : np.random.choice(N//K, N), # small groups (int)
'v1' : np.random.choice(5, N), # int in range [1,5]
'v2' : np.random.choice(5, N), # int in range [1,5]
'v3' : randFloat(100,N) # numeric e.g. 23.5749
})
cdf = cudf.DataFrame.from_pandas(pdf)
8. 8
BENCHMARK #1
%%timeit -r 3 -n 3
pdf.groupby(['id1']).agg({'v1':'sum’})
776 ms ± 4.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1']).agg({'v1':'sum’})
21.5 ms ± 1.3 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Small number of large groups
9. 9
BENCHMARK #2
%%timeit -r 3 -n 3
pdf.groupby(['id1','id2']).agg({'v1':'sum’})
1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1','id2']).agg({'v1':'sum’})
37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Multiple groups
10. 10
BENCHMARK #3
%%timeit -r 3 -n 3
pdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’})
1.36 s ± 21.9 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’})
53 ms ± 2.42 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Large number (1e5) of small groups, multiple arrgegates
GroupBy benchmark notebook
11. 11
WAIT A MINUTE…
• Pandas is single-threaded, but there is Dask
• cuDF is a single GPU solution
12. 12
WAIT A MINUTE…
ddf = dask.dataframe.from_pandas(pdf, npartitions=8)
%%timeit -r 3 -n 3
pdf.groupby(['id1','id2']).agg({'v1':'sum’})
1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
ddf.groupby(["id1", "id2"]).agg({'v1': 'sum'}).compute()
1.34 s ± 33.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1','id2']).agg({'v1':'sum’})
37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
DASK DataFrame execution
15. 15
Category Algorithm Notes
Clustering
Density-Based Spatial Clustering
of Applications with Noise
(DBSCAN)
K-Means Multi-node multi-GPU via Dask
Dimensionality Reduction
Principal Components Analysis
(PCA)
Multi-node multi-GPU via Dask
Truncated Singular Value
Decomposition (tSVD)
Multi-node multi-GPU via Dask
Uniform Manifold Approximation
and Projection (UMAP)
Random Projection
t-Distributed Stochastic
Neighbor Embedding (TSNE)
Linear Models for Regression
or Classification
Linear Regression (OLS)
Linear Regression with Lasso or
Ridge Regularization
ElasticNet Regression
Logistic Regression
Stochastic Gradient Descent
(SGD), Coordinate Descent (CD),
and Quasi-Newton (QN)
(including L-BFGS and OWL-QN)
solvers for linear models
16. 16
Category Algorithm Notes
Nonlinear Models for
Regression or Classification
Random Forest (RF)
Classification
Experimental multi-node multi-
GPU via Dask
Random Forest (RF) Regression
Experimental multi-node multi-
GPU via Dask
Inference for decision tree-
based models
Forest Inference Library (FIL)
K-Nearest Neighbors (KNN)
Multi-node multi-GPU via Dask,
uses Faiss for Nearest Neighbors
Query.
K-Nearest Neighbors (KNN)
Classification
K-Nearest Neighbors (KNN)
Regression
Support Vector Machine
Classifier (SVC)
Epsilon-Support Vector
Regression (SVR)
Time Series Linear Kalman Filter
Holt-Winters Exponential
Smoothing
Auto-regressive Integrated
Moving Average (ARIMA)
18. 18
START DASK CLUSTER
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)
# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization
19. 19
GENERATE DATA
# Data parameters
train_size = int(1e6)
test_size = int(1e3)
n_samples = train_size + test_size
n_features = 20
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,
n_clusters_per_class=1, n_informative=int(n_features / 3),
random_state=123, n_classes=5)
y = y.astype(np.int32)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)
20. 20
DISTRIBUTE DATA TO GPUS
n_partitions = n_workers
# First convert to cudf (with real data, you would likely load in cuDF format to start)
X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
y_train_cudf = cudf.Series(y_train)
# Partition with Dask
# In this case, each worker will train on 1/n_partitions fraction of the data
X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions)
# Persist to cache the data in active memory
X_train_dask, y_train_dask =
dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask], workers=workers)
24. 24
BUILD DISTRIBUTED CUML MODEL
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000
%%time
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins,
n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)
wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
CPU times: user 133 ms, sys: 24.4 ms, total: 157 ms
Wall time: 1.93 s
25. 25
PREDICT AND CHECK ACCURACY
skl_y_pred = skl_model.predict(X_test)
cuml_y_pred = cuml_model.predict(X_test)
# Due to randomness in the algorithm, you may see slight variation in accuracies
print("SKLearn accuracy: ", accuracy_score(y_test, skl_y_pred))
print("CuML accuracy: ", accuracy_score(y_test, cuml_y_pred))
SKLearn accuracy: 0.899
CuML accuracy: 0.886
Random Forest SNMG demo
27. 27
YES!
• Still pretty amature and not ready for production
• Especially DASK
• Porting UDFs is hard [1, 2]
• No CPU version (even for inference)
• No automatic memory management
• Due to obvious reasons
1. Apply Operations in cuDF
2. Numba cuDF integration
32. 32
GPU TAKE AWAYS
1. GPU memory bus is ~7x wider than CPU
2. GPU has thousands of “simple” ALUs
3. GPU is a peripherial device
1. CPU needs to run a CUDA kernel on GPU
2. GPU connects to CPU via PCI Express
33. 33
DRAM CPU GPU DRAM
GPU
(Tesla V100)
DDR4 4ch
60 GB/s
PCI v4 x16
32 GB/s
HBM2
900 GB/s
CPU TO GPU IS SLOW!
30x performance drop
34. 34
GPU BEST PRACTICE
1. Data must not leave GPU memory!
2. You will get performance boost if your dataset is big enough to keep GPU busy
3. Use Apache Arrow compatible formats (e.g. Parquet)
4. Keep an eye on GPUDirect Storage and similar
5. CUDA is different to what you’re used to. Accept it and make use of it!
35. 35
USEFUL LINKS
RAPIDS
RAPIDS DOCS
rapids-nightly dockerhub (use it except for production)
RAPIDS Notebooks
RAPIDS Contributed Notebooks
kNN 600x speedup on MNIST (Kaggle notebook)
Multi-GPU XGBoost with RAPIDS
Dmitry Ursegov presentation for Moscow Spark #7
Numba for CUDA GPUs
PyCUDA