SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Pavel Klemenkov, Chief Data Scientist @ NVIDIA
RAPIDS: SPEEDING UP
PANDAS AND SCIKIT-LEARN
2
TYPICAL DS PIPELINE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Can we test more
hypothesis per unit of
time?
3
TYPICAL DS PIPELINE
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Can we test more
hypothesis per unit of
time?
Hyperparameters
optimization
4
RAPIDS — OPEN GPU DATA SCIENCE
Software Stack Python
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
5
GETTING STARTED
rapids.ai getting started
10 minutes to cuDF
6
“GROUP BY” BENCHMARK
7
def randChar(f, numGrp, N):
things = [f.format(x) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
def randFloat(numGrp, N) :
things = [round(100 * np.random.random(), 4) for x in range(numGrp)]
return [things[x] for x in np.random.choice(numGrp, N)]
N = int(1e7)
K = 100
pdf = pd.DataFrame({
'id1' : randChar("id{0:0=3d}", K, N), # large groups (char)
'id2' : randChar("id{0:0=3d}", K, N), # large groups (char)
'id3' : randChar("id{0:0=3d}", N//K, N), # small groups (char)
'id4' : np.random.choice(K, N), # large groups (int)
'id5' : np.random.choice(K, N), # large groups (int)
'id6' : np.random.choice(N//K, N), # small groups (int)
'v1' : np.random.choice(5, N), # int in range [1,5]
'v2' : np.random.choice(5, N), # int in range [1,5]
'v3' : randFloat(100,N) # numeric e.g. 23.5749
})
cdf = cudf.DataFrame.from_pandas(pdf)
8
BENCHMARK #1
%%timeit -r 3 -n 3
pdf.groupby(['id1']).agg({'v1':'sum’})
776 ms ± 4.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1']).agg({'v1':'sum’})
21.5 ms ± 1.3 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Small number of large groups
9
BENCHMARK #2
%%timeit -r 3 -n 3
pdf.groupby(['id1','id2']).agg({'v1':'sum’})
1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1','id2']).agg({'v1':'sum’})
37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Multiple groups
10
BENCHMARK #3
%%timeit -r 3 -n 3
pdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’})
1.36 s ± 21.9 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’})
53 ms ± 2.42 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
Large number (1e5) of small groups, multiple arrgegates
GroupBy benchmark notebook
11
WAIT A MINUTE…
• Pandas is single-threaded, but there is Dask
• cuDF is a single GPU solution
12
WAIT A MINUTE…
ddf = dask.dataframe.from_pandas(pdf, npartitions=8)
%%timeit -r 3 -n 3
pdf.groupby(['id1','id2']).agg({'v1':'sum’})
1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
ddf.groupby(["id1", "id2"]).agg({'v1': 'sum'}).compute()
1.34 s ± 33.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
%%timeit -r 3 -n 3
cdf.groupby(['id1','id2']).agg({'v1':'sum’})
37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
DASK DataFrame execution
13
+
14
CUML
15
Category Algorithm Notes
Clustering
Density-Based Spatial Clustering
of Applications with Noise
(DBSCAN)
K-Means Multi-node multi-GPU via Dask
Dimensionality Reduction
Principal Components Analysis
(PCA)
Multi-node multi-GPU via Dask
Truncated Singular Value
Decomposition (tSVD)
Multi-node multi-GPU via Dask
Uniform Manifold Approximation
and Projection (UMAP)
Random Projection
t-Distributed Stochastic
Neighbor Embedding (TSNE)
Linear Models for Regression
or Classification
Linear Regression (OLS)
Linear Regression with Lasso or
Ridge Regularization
ElasticNet Regression
Logistic Regression
Stochastic Gradient Descent
(SGD), Coordinate Descent (CD),
and Quasi-Newton (QN)
(including L-BFGS and OWL-QN)
solvers for linear models
16
Category Algorithm Notes
Nonlinear Models for
Regression or Classification
Random Forest (RF)
Classification
Experimental multi-node multi-
GPU via Dask
Random Forest (RF) Regression
Experimental multi-node multi-
GPU via Dask
Inference for decision tree-
based models
Forest Inference Library (FIL)
K-Nearest Neighbors (KNN)
Multi-node multi-GPU via Dask,
uses Faiss for Nearest Neighbors
Query.
K-Nearest Neighbors (KNN)
Classification
K-Nearest Neighbors (KNN)
Regression
Support Vector Machine
Classifier (SVC)
Epsilon-Support Vector
Regression (SVR)
Time Series Linear Kalman Filter
Holt-Winters Exponential
Smoothing
Auto-regressive Integrated
Moving Average (ARIMA)
17
RANDOM FOREST SNMG
18
START DASK CLUSTER
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)
# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization
19
GENERATE DATA
# Data parameters
train_size = int(1e6)
test_size = int(1e3)
n_samples = train_size + test_size
n_features = 20
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,
n_clusters_per_class=1, n_informative=int(n_features / 3),
random_state=123, n_classes=5)
y = y.astype(np.int32)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)
20
DISTRIBUTE DATA TO GPUS
n_partitions = n_workers
# First convert to cudf (with real data, you would likely load in cuDF format to start)
X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train))
y_train_cudf = cudf.Series(y_train)
# Partition with Dask
# In this case, each worker will train on 1/n_partitions fraction of the data
X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions)
# Persist to cache the data in active memory
X_train_dask, y_train_dask = 
dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask], workers=workers)
21
22
BUILD A SCIKIT-LEARN MODEL
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000
%%time
# Use all avilable CPU cores
skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)
skl_model.fit(X_train, y_train)
CPU times: user 3h 3min 18s, sys: 32.3 s, total: 3h 3min 51s
Wall time: 2min 27s
23
24
BUILD DISTRIBUTED CUML MODEL
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000
%%time
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins,
n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)
wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
CPU times: user 133 ms, sys: 24.4 ms, total: 157 ms
Wall time: 1.93 s
25
PREDICT AND CHECK ACCURACY
skl_y_pred = skl_model.predict(X_test)
cuml_y_pred = cuml_model.predict(X_test)
# Due to randomness in the algorithm, you may see slight variation in accuracies
print("SKLearn accuracy: ", accuracy_score(y_test, skl_y_pred))
print("CuML accuracy: ", accuracy_score(y_test, cuml_y_pred))
SKLearn accuracy: 0.899
CuML accuracy: 0.886
Random Forest SNMG demo
26
ANY PROBLEMS?
27
YES!
• Still pretty amature and not ready for production
• Especially DASK
• Porting UDFs is hard [1, 2]
• No CPU version (even for inference)
• No automatic memory management
• Due to obvious reasons
1. Apply Operations in cuDF
2. Numba cuDF integration
28
GPU 101
29
2010 2016 2019 Scale factor
Storage 50 MB/s
(HDD)
500 MB/s
(SATA-SSD)
2 GB/s (NVMe-
SSD)
40х
Network 1 Gbit/s 10 Gbit/s 40 Gbit/s 40х
CPU 500 GFLOPS 1 200
GFLOPS
3 000 GFLOPS
(18
cores/avx512)
6x
CPU
mem
40 GB/s 80 GB/s 125 GB/s 3х
GPU 1 300 GFLOPS 6 000
GFLOPS
15 000 GFLOPS 12x
GPU
mem
150 GB/s 480 GB/s 900 GB/s 6х
30
Performance,
GFLOPS
Memory
bandwidth,
GB/s
TDP, W Price, $
Nvidia
Tesla T4
8 100 320 75 3000
Intel®
Xeon® Gold
6140
2 500 120 140 3000
31
GPU VS CPU ARCHITECTURE
32
GPU TAKE AWAYS
1. GPU memory bus is ~7x wider than CPU
2. GPU has thousands of “simple” ALUs
3. GPU is a peripherial device
1. CPU needs to run a CUDA kernel on GPU
2. GPU connects to CPU via PCI Express
33
DRAM CPU GPU DRAM
GPU
(Tesla V100)
DDR4 4ch
60 GB/s
PCI v4 x16
32 GB/s
HBM2
900 GB/s
CPU TO GPU IS SLOW!
30x performance drop
34
GPU BEST PRACTICE
1. Data must not leave GPU memory!
2. You will get performance boost if your dataset is big enough to keep GPU busy
3. Use Apache Arrow compatible formats (e.g. Parquet)
4. Keep an eye on GPUDirect Storage and similar
5. CUDA is different to what you’re used to. Accept it and make use of it!
35
USEFUL LINKS
RAPIDS
RAPIDS DOCS
rapids-nightly dockerhub (use it except for production)
RAPIDS Notebooks
RAPIDS Contributed Notebooks
kNN 600x speedup on MNIST (Kaggle notebook)
Multi-GPU XGBoost with RAPIDS
Dmitry Ursegov presentation for Moscow Spark #7
Numba for CUDA GPUs
PyCUDA
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia

Contenu connexe

Tendances

Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
 
Device-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded SystemsDevice-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded Systems
emBO_Conference
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Jayesh Thakrar
 
How to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in JavaHow to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in Java
srisatish ambati
 

Tendances (20)

Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Distributed caching and computing v3.7
Distributed caching and computing v3.7Distributed caching and computing v3.7
Distributed caching and computing v3.7
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010NoSQL @ CodeMash 2010
NoSQL @ CodeMash 2010
 
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...
 
Tutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GANTutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GAN
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
Device-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded SystemsDevice-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded Systems
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
OS-Assisted Task Preemption for Hadoop
OS-Assisted Task Preemption for HadoopOS-Assisted Task Preemption for Hadoop
OS-Assisted Task Preemption for Hadoop
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cache
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
 
How to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in JavaHow to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in Java
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 

Similaire à RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 

Similaire à RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia (20)

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Do snow.rwn
Do snow.rwnDo snow.rwn
Do snow.rwn
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 

Plus de Mail.ru Group

AMP для электронной почты, Сергей Пешков
AMP для электронной почты, Сергей ПешковAMP для электронной почты, Сергей Пешков
AMP для электронной почты, Сергей Пешков
Mail.ru Group
 

Plus de Mail.ru Group (20)

Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
Автоматизация без тест-инженеров по автоматизации, Мария Терехина и Владислав...
 
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
BDD для фронтенда. Автоматизация тестирования с Cucumber, Cypress и Jenkins, ...
 
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир ДубровинДругая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
Другая сторона баг-баунти-программ: как это выглядит изнутри, Владимир Дубровин
 
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
Использование Fiddler и Charles при тестировании фронтенда проекта pulse.mail...
 
Управление инцидентами в Почте Mail.ru, Антон Викторов
Управление инцидентами в Почте Mail.ru, Антон ВикторовУправление инцидентами в Почте Mail.ru, Антон Викторов
Управление инцидентами в Почте Mail.ru, Антон Викторов
 
DAST в CI/CD, Ольга Свиридова
DAST в CI/CD, Ольга СвиридоваDAST в CI/CD, Ольга Свиридова
DAST в CI/CD, Ольга Свиридова
 
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...
Почему вам стоит использовать свой велосипед и почему не стоит  Александр Бел...Почему вам стоит использовать свой велосипед и почему не стоит  Александр Бел...
Почему вам стоит использовать свой велосипед и почему не стоит Александр Бел...
 
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...
CV в пайплайне распознавания ценников товаров: трюки и хитрости  Николай Масл...CV в пайплайне распознавания ценников товаров: трюки и хитрости  Николай Масл...
CV в пайплайне распознавания ценников товаров: трюки и хитрости Николай Масл...
 
WebAuthn в реальной жизни, Анатолий Остапенко
WebAuthn в реальной жизни, Анатолий ОстапенкоWebAuthn в реальной жизни, Анатолий Остапенко
WebAuthn в реальной жизни, Анатолий Остапенко
 
AMP для электронной почты, Сергей Пешков
AMP для электронной почты, Сергей ПешковAMP для электронной почты, Сергей Пешков
AMP для электронной почты, Сергей Пешков
 
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила СтрелковКак мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
Как мы захотели TWA и сделали его без мобильных разработчиков, Данила Стрелков
 
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
Кейсы использования PWA для партнерских предложений в Delivery Club, Никита Б...
 
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.ТаксиМетапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
Метапрограммирование: строим конечный автомат, Сергей Федоров, Яндекс.Такси
 
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru GroupКак не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
Как не сделать врагами архитектуру и оптимизацию, Кирилл Березин, Mail.ru Group
 
Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Этика искусственного интеллекта, Александр Кармаев (AI Journey)Этика искусственного интеллекта, Александр Кармаев (AI Journey)
Этика искусственного интеллекта, Александр Кармаев (AI Journey)
 
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
Нейро-машинный перевод в вопросно-ответных системах, Федор Федоренко (AI Jour...
 
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
Конвергенция технологий как тренд развития искусственного интеллекта, Владими...
 
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
Обзор трендов рекомендательных систем от Пульса, Андрей Мурашев (AI Journey)
 
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
Мир глазами нейросетей, Данила Байгушев, Александр Сноркин ()
 
Learning from Swift sources, Иван Сметанин
Learning from Swift sources, Иван СметанинLearning from Swift sources, Иван Сметанин
Learning from Swift sources, Иван Сметанин
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia

  • 1. Pavel Klemenkov, Chief Data Scientist @ NVIDIA RAPIDS: SPEEDING UP PANDAS AND SCIKIT-LEARN
  • 2. 2 TYPICAL DS PIPELINE All Data ETL Manage Data Structured Data Store Data Preparation Training Model Training Visualization Evaluate Inference Deploy Can we test more hypothesis per unit of time?
  • 3. 3 TYPICAL DS PIPELINE All Data ETL Manage Data Structured Data Store Data Preparation Training Model Training Visualization Evaluate Inference Deploy Can we test more hypothesis per unit of time? Hyperparameters optimization
  • 4. 4 RAPIDS — OPEN GPU DATA SCIENCE Software Stack Python CUDA PYTHON APACHE ARROW on GPU Memory DASK DEEP LEARNING FRAMEWORKS CUDNN RAPIDS CUMLCUDF CUGRAPH
  • 5. 5 GETTING STARTED rapids.ai getting started 10 minutes to cuDF
  • 7. 7 def randChar(f, numGrp, N): things = [f.format(x) for x in range(numGrp)] return [things[x] for x in np.random.choice(numGrp, N)] def randFloat(numGrp, N) : things = [round(100 * np.random.random(), 4) for x in range(numGrp)] return [things[x] for x in np.random.choice(numGrp, N)] N = int(1e7) K = 100 pdf = pd.DataFrame({ 'id1' : randChar("id{0:0=3d}", K, N), # large groups (char) 'id2' : randChar("id{0:0=3d}", K, N), # large groups (char) 'id3' : randChar("id{0:0=3d}", N//K, N), # small groups (char) 'id4' : np.random.choice(K, N), # large groups (int) 'id5' : np.random.choice(K, N), # large groups (int) 'id6' : np.random.choice(N//K, N), # small groups (int) 'v1' : np.random.choice(5, N), # int in range [1,5] 'v2' : np.random.choice(5, N), # int in range [1,5] 'v3' : randFloat(100,N) # numeric e.g. 23.5749 }) cdf = cudf.DataFrame.from_pandas(pdf)
  • 8. 8 BENCHMARK #1 %%timeit -r 3 -n 3 pdf.groupby(['id1']).agg({'v1':'sum’}) 776 ms ± 4.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 cdf.groupby(['id1']).agg({'v1':'sum’}) 21.5 ms ± 1.3 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) Small number of large groups
  • 9. 9 BENCHMARK #2 %%timeit -r 3 -n 3 pdf.groupby(['id1','id2']).agg({'v1':'sum’}) 1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 cdf.groupby(['id1','id2']).agg({'v1':'sum’}) 37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) Multiple groups
  • 10. 10 BENCHMARK #3 %%timeit -r 3 -n 3 pdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’}) 1.36 s ± 21.9 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 cdf.groupby(['id3']).agg({'v1':'sum', 'v3':'mean’}) 53 ms ± 2.42 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) Large number (1e5) of small groups, multiple arrgegates GroupBy benchmark notebook
  • 11. 11 WAIT A MINUTE… • Pandas is single-threaded, but there is Dask • cuDF is a single GPU solution
  • 12. 12 WAIT A MINUTE… ddf = dask.dataframe.from_pandas(pdf, npartitions=8) %%timeit -r 3 -n 3 pdf.groupby(['id1','id2']).agg({'v1':'sum’}) 1.79 s ± 10.7 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 ddf.groupby(["id1", "id2"]).agg({'v1': 'sum'}).compute() 1.34 s ± 33.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) %%timeit -r 3 -n 3 cdf.groupby(['id1','id2']).agg({'v1':'sum’}) 37.5 ms ± 14.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each) DASK DataFrame execution
  • 13. 13 +
  • 15. 15 Category Algorithm Notes Clustering Density-Based Spatial Clustering of Applications with Noise (DBSCAN) K-Means Multi-node multi-GPU via Dask Dimensionality Reduction Principal Components Analysis (PCA) Multi-node multi-GPU via Dask Truncated Singular Value Decomposition (tSVD) Multi-node multi-GPU via Dask Uniform Manifold Approximation and Projection (UMAP) Random Projection t-Distributed Stochastic Neighbor Embedding (TSNE) Linear Models for Regression or Classification Linear Regression (OLS) Linear Regression with Lasso or Ridge Regularization ElasticNet Regression Logistic Regression Stochastic Gradient Descent (SGD), Coordinate Descent (CD), and Quasi-Newton (QN) (including L-BFGS and OWL-QN) solvers for linear models
  • 16. 16 Category Algorithm Notes Nonlinear Models for Regression or Classification Random Forest (RF) Classification Experimental multi-node multi- GPU via Dask Random Forest (RF) Regression Experimental multi-node multi- GPU via Dask Inference for decision tree- based models Forest Inference Library (FIL) K-Nearest Neighbors (KNN) Multi-node multi-GPU via Dask, uses Faiss for Nearest Neighbors Query. K-Nearest Neighbors (KNN) Classification K-Nearest Neighbors (KNN) Regression Support Vector Machine Classifier (SVC) Epsilon-Support Vector Regression (SVR) Time Series Linear Kalman Filter Holt-Winters Exponential Smoothing Auto-regressive Integrated Moving Average (ARIMA)
  • 18. 18 START DASK CLUSTER from dask.distributed import Client from dask_cuda import LocalCUDACluster # This will use all GPUs on the local host by default cluster = LocalCUDACluster(threads_per_worker=1) c = Client(cluster) # Query the client for all connected workers workers = c.has_what().keys() n_workers = len(workers) n_streams = 8 # Performance optimization
  • 19. 19 GENERATE DATA # Data parameters train_size = int(1e6) test_size = int(1e3) n_samples = train_size + test_size n_features = 20 X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features, n_clusters_per_class=1, n_informative=int(n_features / 3), random_state=123, n_classes=5) y = y.astype(np.int32) X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size)
  • 20. 20 DISTRIBUTE DATA TO GPUS n_partitions = n_workers # First convert to cudf (with real data, you would likely load in cuDF format to start) X_train_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X_train)) y_train_cudf = cudf.Series(y_train) # Partition with Dask # In this case, each worker will train on 1/n_partitions fraction of the data X_train_dask = dask_cudf.from_cudf(X_train_cudf, npartitions=n_partitions) y_train_dask = dask_cudf.from_cudf(y_train_cudf, npartitions=n_partitions) # Persist to cache the data in active memory X_train_dask, y_train_dask = dask_utils.persist_across_workers(c, [X_train_dask, y_train_dask], workers=workers)
  • 21. 21
  • 22. 22 BUILD A SCIKIT-LEARN MODEL # Random Forest building parameters max_depth = 12 n_bins = 16 n_trees = 1000 %%time # Use all avilable CPU cores skl_model = sklRF(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1) skl_model.fit(X_train, y_train) CPU times: user 3h 3min 18s, sys: 32.3 s, total: 3h 3min 51s Wall time: 2min 27s
  • 23. 23
  • 24. 24 BUILD DISTRIBUTED CUML MODEL # Random Forest building parameters max_depth = 12 n_bins = 16 n_trees = 1000 %%time cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams) cuml_model.fit(X_train_dask, y_train_dask) wait(cuml_model.rfs) # Allow asynchronous training tasks to finish CPU times: user 133 ms, sys: 24.4 ms, total: 157 ms Wall time: 1.93 s
  • 25. 25 PREDICT AND CHECK ACCURACY skl_y_pred = skl_model.predict(X_test) cuml_y_pred = cuml_model.predict(X_test) # Due to randomness in the algorithm, you may see slight variation in accuracies print("SKLearn accuracy: ", accuracy_score(y_test, skl_y_pred)) print("CuML accuracy: ", accuracy_score(y_test, cuml_y_pred)) SKLearn accuracy: 0.899 CuML accuracy: 0.886 Random Forest SNMG demo
  • 27. 27 YES! • Still pretty amature and not ready for production • Especially DASK • Porting UDFs is hard [1, 2] • No CPU version (even for inference) • No automatic memory management • Due to obvious reasons 1. Apply Operations in cuDF 2. Numba cuDF integration
  • 29. 29 2010 2016 2019 Scale factor Storage 50 MB/s (HDD) 500 MB/s (SATA-SSD) 2 GB/s (NVMe- SSD) 40х Network 1 Gbit/s 10 Gbit/s 40 Gbit/s 40х CPU 500 GFLOPS 1 200 GFLOPS 3 000 GFLOPS (18 cores/avx512) 6x CPU mem 40 GB/s 80 GB/s 125 GB/s 3х GPU 1 300 GFLOPS 6 000 GFLOPS 15 000 GFLOPS 12x GPU mem 150 GB/s 480 GB/s 900 GB/s 6х
  • 30. 30 Performance, GFLOPS Memory bandwidth, GB/s TDP, W Price, $ Nvidia Tesla T4 8 100 320 75 3000 Intel® Xeon® Gold 6140 2 500 120 140 3000
  • 31. 31 GPU VS CPU ARCHITECTURE
  • 32. 32 GPU TAKE AWAYS 1. GPU memory bus is ~7x wider than CPU 2. GPU has thousands of “simple” ALUs 3. GPU is a peripherial device 1. CPU needs to run a CUDA kernel on GPU 2. GPU connects to CPU via PCI Express
  • 33. 33 DRAM CPU GPU DRAM GPU (Tesla V100) DDR4 4ch 60 GB/s PCI v4 x16 32 GB/s HBM2 900 GB/s CPU TO GPU IS SLOW! 30x performance drop
  • 34. 34 GPU BEST PRACTICE 1. Data must not leave GPU memory! 2. You will get performance boost if your dataset is big enough to keep GPU busy 3. Use Apache Arrow compatible formats (e.g. Parquet) 4. Keep an eye on GPUDirect Storage and similar 5. CUDA is different to what you’re used to. Accept it and make use of it!
  • 35. 35 USEFUL LINKS RAPIDS RAPIDS DOCS rapids-nightly dockerhub (use it except for production) RAPIDS Notebooks RAPIDS Contributed Notebooks kNN 600x speedup on MNIST (Kaggle notebook) Multi-GPU XGBoost with RAPIDS Dmitry Ursegov presentation for Moscow Spark #7 Numba for CUDA GPUs PyCUDA