SlideShare une entreprise Scribd logo
1  sur  85
GPU + Deep learning
Basics & best practices
Lior Sidi
DMBI Datahack – May 2019
About Me
Data scientist & More 
Agenda
• Deep learning basics
• GPU Basics
• CUDA
• Multi-GPU
• GPU Configurations
• IDE Configuration
• Spark for inference
• GPU Demo
Slides are based on..
• My experience
• Machine learning course - Prof Lior Rokach
• Stanford CS231
• NVIDIA.com
• Tensorflow.org
Deep learning basics
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
6
Recap
• Training Starts through the input layer:
• The same happens for y2 and y3.
7
Intuition by Illustration
• Propagation of signals through the hidden layer:
• The same happens for y5.
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
8
Intuition by Illustration
• Propagation of signals through the output layer:
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
9
Intuition by Illustration
• Error signal of output layer neuron:
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
10
Intuition by Illustration
• propagate error
signal back to all
neurons.
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
11
Intuition by Illustration
• If propagated errors came from few neurons, they are added:
• The same happens for neuron-2 and neuron-3.
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
12
Intuition by Illustration
• Weight updating starts:
• The same happens for all neurons.
Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
Tensorboard
https://www.tensorflow.org/guide/graphs
Model
Parameters
Dataset
Sample
Dataset
Dataset
Dataset
Dataset
sample
Training
MachineSample
Dataset How can we make the training Faster?
Training Flow
Model
Parameters
Dataset
Sample
Dataset
Dataset
Dataset
Dataset
sample
Batch
Training
MachineDataset
Dataset
sample
Dataset How can we make the training Faster?
Compute the gradients on batch of samples
Training Flow
But How?
GPU basics
This image is licensed under CC-BY 2.0
Spot the CPU!
(central processing unit)
http://cs231n.stanford.edu/
Spot the GPUs!
(graphics processing unit)
This image is in the public domain
http://cs231n.stanford.edu/
CPU / GPU Communication
Model
is here
Data is here
http://cs231n.stanford.edu/
CPU / GPU Communication
Model
is here
Data is here
If you aren’t careful, training can
bottleneck on reading data and
transferring to GPU!
Solutions:
- Read all data into RAM
- Use SSD instead of HDD
- Use multiple CPU threads
to prefetch data
http://cs231n.stanford.edu/
CPU vs GPU
CPU vs GPU
Cores Few very complex Hundreds simple
instructions Different Same
Management Operation system Hardware
Operations Serial parallel
CPU vs GPU
Cores Few very complex Hundreds simple
instructions Different Same
Management Operation system Hardware
Operations Serial parallel
High throughput
(number of task per unit time)
Low Latency
(time to do Task)
CPU vs GPU
*A teraflop refers to the capability of a processor to calculate
one trillion floating-point operations per second
http://cs231n.stanford.edu/
CPU vs GPU
CPU vs GPU
http://cs231n.stanford.edu/
http://cs231n.stanford.edu/
DATA Center VS AI LAB
http://christopher5106.github.io/big/data/2015/07/31/deep-learning-machine-gpu-accelerated-computing-versus-cluster.html
Other cool DL hardware (we wont cover)
FPGA - Field Programmable Gate Array
Optimized for inference
power efficiency
flexible hardware architecture
functional safety
https://www.aldec.com/en/company/blog/167--fpgas-vs-gpus-for-machine-learning-applications-which-one-is-better
CUDA
CUDA
• CUDA is a parallel computing platform and application programming
interface (API) model created by Nvidia
• software layer that gives direct access to the GPU
CUDA – processing flow
CUDA Deep Neural Network
• a GPU-accelerated library for deep neural networks.
• Provides highly tuned implementations for standard routines such as
forward and backward convolution, pooling, normalization, and
activation layers.
CPU vs GPU in practice
(CPU performance not
well-optimized, a little unfair)
66x 67x 71x 64x 76x
Data from https://github.com/jcjohnson/cnn-benchmarks
CPU vs GPU in practice
cuDNN much faster than
“unoptimized” CUDA
2.8x 3.0x 3.1x 3.4x 2.8x
Data from https://github.com/jcjohnson/cnn-benchmarks
Multi-GPU
TheNeed for DistributedTraining
• Largerand Deeper models arebeingproposed; AlexNetto ResNetto NMT
– DNNsrequire a lot of memory
– Larger models cannotfita GPU’s memory
• Single GPU training became abottleneck
• As mentionedearlier,communityhas alreadymoved to multi-GPUtraining
• Multi-GPU in one node is good but thereis alimitto Scale-up(8 GPUs)
• Multi-node (Distributed or Parallel) Training isnecessary!!
Comparing complexity...
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
8/6/2017: Facebook managed to reduce the
training time of a ResNet-50 deep learning model
on ImageNet from 29 hours to one hour
Instead of using batches of 256 images with eight
GPUs they use batch sizes of 8,192 images
distributed across 256 GPUs.
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017.
Parallelism Types
model parallelism
different machines in the
distributed system are responsible
for the computations in different
parts of a single network.
for example, each layer in the
neural network may be assigned
to a different machine.
Parallelism Types
model parallelism
different machines in the
distributed system are responsible
for the computations in different
parts of a single network.
for example, each layer in the
neural network may be assigned
to a different machine.
data parallelism
different machines have a
complete copy of the model; each
machine simply gets a different
portion of the data, and results
from each are somehow
combined
Hybrid Model
Combinationof ParallelizationStrategies
HotInterconnects‘17NetworkBasedComputingLaboratory 44
Courtesy: http://on-demand.gputechconf.com/gtc/2017/presentation/s7724-minjie-wong-tofu-parallelizing-deep-learning.pdf
Data Parallelism
• Data parallel approaches to distributed training keep a copy of the
entire model on each worker machine, processing different subsets of
the training data set on each.
Data Parallelism
• Data parallel approaches to distributed training keep a copy of the
entire model on each worker machine, processing different subsets of
the training data set on each.
• Data parallel training approaches all require some method of
combining results and synchronizing the model parameters between
each worker
• Approaches:
• Parameter averaging vs. update (gradient)-based approaches
• Synchronous vs. asynchronous methods
• Centralized vs. distributed synchronization
Parameter Averaging
• Parameter averaging is the conceptually simplest approach to data
parallelism. With parameter averaging, training proceeds as follows:
1. Initialize the network parameters randomly based on the model
configuration
2. Distribute a copy of the current parameters to each worker
3. Train each worker on a subset of the data
4. Set the global parameters to the average the parameters from each
worker
5. While there is more data to process, go to step 2
Parameter Averaging
Multi GPU - Data Parallelism on Keras!
https://keras.io/utils/#multi_gpu_model
Asynchronous Stochastic Gradient Descent
• An ‘update based’ data parallelism.
• The primary difference between the two is that instead of transferring
parameters from the workers to the parameter server, we will
transfer the updates (i.e., gradients post learning rate and
momentum, etc.) instead.
Asynchronous Stochastic Gradient Descent
When to Use Distributed Deep Learning?
GPU Configuration
Common GPU stack for Data science
Pycharm IDE
High level API
Driver
Hardware
Platform / backend
Library
Challenges
• Linux
• SUDO
• Many packages
• Env path
• GPU has strong CPU 
GPU configuration
1. Tensorflow GPU
• pip install tensorflow-gpu
https://www.tensorflow.org/install/gpu
GPU configuration
1. Tensorflow GPU
• pip install tensorflow-gpu
2. Software requirements - Install CUDA with apt
• NVIDIA® GPU drivers —CUDA 10.0 requires 410.x or higher.
• CUDA® Toolkit —TensorFlow supports CUDA 10.0 (TensorFlow >=
1.13.0)
• CUPTI ships with the CUDA Toolkit.
• cuDNN SDK (>= 7.4.1)
https://www.tensorflow.org/install/gpu
GPU configuration
# Add NVIDIA package repositories
# Add HTTPS support for apt-key
sudo apt-get install gnupg-curl
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt-get update
# Install NVIDIA Driver
# Issue with driver install requires creating /usr/lib/nvidia
sudo mkdir /usr/lib/nvidia
sudo apt-get install --no-install-recommends nvidia-410
# Reboot. Check that GPUs are visible using the command: nvidia-smi
# Install development and runtime libraries (~4GB)
sudo apt-get install --no-install-recommends 
cuda-10-0 
libcudnn7=7.4.1.5-1+cuda10.0 
libcudnn7-dev=7.4.1.5-1+cuda10.0
# Install TensorRT. Requires that libcudnn7 is installed above.
sudo apt-get update && 
sudo apt-get install nvinfer-runtime-trt-repo-ubuntu1604-5.0.2-ga-cuda10.0 
&& sudo apt-get update 
&& sudo apt-get install -y --no-install-recommends libnvinfer-dev=5.0.2-1+cuda10.0
https://www.tensorflow.org/install/gpu
NVIDIA System Management Interface
• The NVIDIA System Management Interface (nvidia-smi) is a command
line utility, based on top of the NVIDIA Management Library (NVML),
intended to aid in the management and monitoring of NVIDIA GPU
devices.
• This utility allows administrators to query GPU device state and with
the appropriate privileges, permits administrators to modify GPU
device state. It is targeted at the TeslaTM, GRIDTM, QuadroTM and Titan
X product, though limited support is also available on other NVIDIA
GPUs.
https://developer.nvidia.com/nvidia-system-management-interface
NVIDIA System Management Interface
watch --interval 1 nvidia-smi
MobaXterm
Code Validation
import tensorflow as tf
print(tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None))
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
print local_device_protos
https://github.com/liorsidi/GPU_deep_demo
Code Validation
import tensorflow as tf
print(tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None))
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
print local_device_protos
https://github.com/liorsidi/GPU_deep_demo
IDE Configuration
1. remote interpreter
1. remote interpreter
2. remote interpreter configuration
2. remote interpreter configuration
3. remote interpreter folders
Now wait
Tesnorflow on spark
For inference
Training vs Inference
Inference - Spark
1. Install tensorflow & Keras on each node
2. Train a model on GPU
3. Save model as H5 file
4. Define batch size based on executor memory size & network size
5. Load the saved model on each node in the cluster
6. Run Code
1. Base on RDD
2. Use map partition to call executors code:
1. Load model
2. Predict_on_batch
Inference – Spark Code
import pandas as pd
from keras.models import load_model, Sequential
from pyspark.sql.types import Row
def keras_spark_predict(model_path, weights_path, partition):
# load model
model = Sequential.from_config(model_path.value)
model.set_weights(weights_path.value)
# Create a list containing features.
featurs_list = map(lambda x: [x[:]], partition)
featurs_df = pd.DataFrame(featurs_list)
# predict with keras model
predictions = model.predict_on_batch(featurs_df)
predictions_return = map(lambda prediction: Row(prediction=prediction[0].item()), predictions)
return iter(predictions_return)
rdd = rdd.mapPartitions(lambda partition: keras_spark_predict(model_path, weights_path, partition))
https://github.com/liorsidi/GPU_deep_demo
Keep in mind other newer approaches
• Spark
• sparkflow
• TensorFlowOnSpark
• spark-deep-learning
GPU DEMO
Demo
https://github.com/geifmany/keras_imagenet_training/blob/master/imagenet.py
https://tiny-imagenet.herokuapp.com/
https://github.com/liorsidi/GPU_deep_demo
imatge-upc.github.io/telecombcn-2016-dlcv/slides/D2L1-memory.pdf
imatge-upc.github.io/telecombcn-2016-dlcv/slides/D2L1-memory.pdf
Batch sizing
• Batch size
• Inference = Network parameters
• Fully connected layers = #outputs x #inputs (weights) + #output (Bias)
• In keras:
• Summary
• trainable_weights
• non_trainable_weights
• (+ Data generators)
https://towardsdatascience.com/understanding-and-calculating-the-number-of-parameters-in-convolution-neural-
networks-cnns-fc88790d530d
Back to the Demo
#https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model
def get_model_memory_usage(batch_size, model):
import numpy as np
from keras import backend as K
shapes_mem_count = 0
for l in model.layers:
single_layer_mem = 1
for s in l.output_shape:
if s is None:
continue
single_layer_mem *= s
shapes_mem_count += single_layer_mem
trainable_count = np.sum([K.count_params(p) for p in set(model.trainable_weights)])
non_trainable_count = np.sum([K.count_params(p) for p in set(model.non_trainable_weights)])
number_size = 4.0
if K.floatx() == 'float16':
number_size = 2.0
if K.floatx() == 'float64':
number_size = 8.0
total_memory = number_size*(batch_size*shapes_mem_count + trainable_count + non_trainable_count)
gbytes = np.round(total_memory / (1024.0 ** 3), 3)
return gbytes
https://github.com/liorsidi/GPU_deep_demo
To summarize
• GPU are Awesome
• Mind the batch size
• Monitor your GPU (validate for every tf software update)
• Work with Pycharm – remote interpreter
• Separate between training and inference
• Consider using free cloud tier
• Fast.ai
Thank you & Good luck!
Tips for winning data hackathons
• Separate roles:
• Domain expert – explore the data, define features, read papers, metrics
• Data engineer – preprocess data, extract feature, evaluation pipeline
• Data scientist – algorithm development, evaluation, hyper tuning
• Evaluation – avoid overfitting - someone is trying to trick you
• Be consistent with your plan and feature exploration
• Limited data
• Augmentation
• Extreme regularizations
• Creativity
• Think out of the box
• Use state of the art tools
• Save time and rest

Contenu connexe

Tendances

Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 

Tendances (20)

Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Deep learning on mobile - 2019 Practitioner's Guide
Deep learning on mobile - 2019 Practitioner's GuideDeep learning on mobile - 2019 Practitioner's Guide
Deep learning on mobile - 2019 Practitioner's Guide
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPU
 
Speeding up Deep Learning training and inference
Speeding up Deep Learning training and inferenceSpeeding up Deep Learning training and inference
Speeding up Deep Learning training and inference
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1
 
Machine Learning in Astrophysics
Machine Learning in AstrophysicsMachine Learning in Astrophysics
Machine Learning in Astrophysics
 
Edge Computing Architecture using GPUs and Kubernetes
Edge Computing Architecture using GPUs and KubernetesEdge Computing Architecture using GPUs and Kubernetes
Edge Computing Architecture using GPUs and Kubernetes
 
Multimodal Deep Learning
Multimodal Deep LearningMultimodal Deep Learning
Multimodal Deep Learning
 
Deep learning
Deep learning Deep learning
Deep learning
 
NVIDIA Keynote #GTC21
NVIDIA Keynote #GTC21 NVIDIA Keynote #GTC21
NVIDIA Keynote #GTC21
 
High Performance Computer
High Performance ComputerHigh Performance Computer
High Performance Computer
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 

Similaire à GPU and Deep learning best practices

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 

Similaire à GPU and Deep learning best practices (20)

How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
Building SuperComputers @ Home
Building SuperComputers @ HomeBuilding SuperComputers @ Home
Building SuperComputers @ Home
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Training course lect1
Training course lect1Training course lect1
Training course lect1
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
 

Dernier

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Dernier (20)

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 

GPU and Deep learning best practices

  • 1. GPU + Deep learning Basics & best practices Lior Sidi DMBI Datahack – May 2019
  • 3. Agenda • Deep learning basics • GPU Basics • CUDA • Multi-GPU • GPU Configurations • IDE Configuration • Spark for inference • GPU Demo
  • 4. Slides are based on.. • My experience • Machine learning course - Prof Lior Rokach • Stanford CS231 • NVIDIA.com • Tensorflow.org
  • 6. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie 6 Recap • Training Starts through the input layer: • The same happens for y2 and y3.
  • 7. 7 Intuition by Illustration • Propagation of signals through the hidden layer: • The same happens for y5. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
  • 8. 8 Intuition by Illustration • Propagation of signals through the output layer: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
  • 9. 9 Intuition by Illustration • Error signal of output layer neuron: Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
  • 10. 10 Intuition by Illustration • propagate error signal back to all neurons. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
  • 11. 11 Intuition by Illustration • If propagated errors came from few neurons, they are added: • The same happens for neuron-2 and neuron-3. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
  • 12. 12 Intuition by Illustration • Weight updating starts: • The same happens for all neurons. Error-Back-Propagation, Baharvand, Ahmadi, Rahaie
  • 15. Model Parameters Dataset Sample Dataset Dataset Dataset Dataset sample Batch Training MachineDataset Dataset sample Dataset How can we make the training Faster? Compute the gradients on batch of samples Training Flow But How?
  • 16.
  • 18. This image is licensed under CC-BY 2.0 Spot the CPU! (central processing unit) http://cs231n.stanford.edu/
  • 19. Spot the GPUs! (graphics processing unit) This image is in the public domain http://cs231n.stanford.edu/
  • 20. CPU / GPU Communication Model is here Data is here http://cs231n.stanford.edu/
  • 21. CPU / GPU Communication Model is here Data is here If you aren’t careful, training can bottleneck on reading data and transferring to GPU! Solutions: - Read all data into RAM - Use SSD instead of HDD - Use multiple CPU threads to prefetch data http://cs231n.stanford.edu/
  • 23. CPU vs GPU Cores Few very complex Hundreds simple instructions Different Same Management Operation system Hardware Operations Serial parallel
  • 24. CPU vs GPU Cores Few very complex Hundreds simple instructions Different Same Management Operation system Hardware Operations Serial parallel High throughput (number of task per unit time) Low Latency (time to do Task)
  • 25. CPU vs GPU *A teraflop refers to the capability of a processor to calculate one trillion floating-point operations per second http://cs231n.stanford.edu/
  • 29. DATA Center VS AI LAB http://christopher5106.github.io/big/data/2015/07/31/deep-learning-machine-gpu-accelerated-computing-versus-cluster.html
  • 30.
  • 31. Other cool DL hardware (we wont cover) FPGA - Field Programmable Gate Array Optimized for inference power efficiency flexible hardware architecture functional safety https://www.aldec.com/en/company/blog/167--fpgas-vs-gpus-for-machine-learning-applications-which-one-is-better
  • 32. CUDA
  • 33. CUDA • CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia • software layer that gives direct access to the GPU
  • 35. CUDA Deep Neural Network • a GPU-accelerated library for deep neural networks. • Provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
  • 36. CPU vs GPU in practice (CPU performance not well-optimized, a little unfair) 66x 67x 71x 64x 76x Data from https://github.com/jcjohnson/cnn-benchmarks
  • 37. CPU vs GPU in practice cuDNN much faster than “unoptimized” CUDA 2.8x 3.0x 3.1x 3.4x 2.8x Data from https://github.com/jcjohnson/cnn-benchmarks
  • 39. TheNeed for DistributedTraining • Largerand Deeper models arebeingproposed; AlexNetto ResNetto NMT – DNNsrequire a lot of memory – Larger models cannotfita GPU’s memory • Single GPU training became abottleneck • As mentionedearlier,communityhas alreadymoved to multi-GPUtraining • Multi-GPU in one node is good but thereis alimitto Scale-up(8 GPUs) • Multi-node (Distributed or Parallel) Training isnecessary!!
  • 40. Comparing complexity... An Analysis of Deep Neural Network Models for Practical Applications, 2017. 8/6/2017: Facebook managed to reduce the training time of a ResNet-50 deep learning model on ImageNet from 29 hours to one hour Instead of using batches of 256 images with eight GPUs they use batch sizes of 8,192 images distributed across 256 GPUs. Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017.
  • 41. Parallelism Types model parallelism different machines in the distributed system are responsible for the computations in different parts of a single network. for example, each layer in the neural network may be assigned to a different machine.
  • 42. Parallelism Types model parallelism different machines in the distributed system are responsible for the computations in different parts of a single network. for example, each layer in the neural network may be assigned to a different machine. data parallelism different machines have a complete copy of the model; each machine simply gets a different portion of the data, and results from each are somehow combined
  • 44. Combinationof ParallelizationStrategies HotInterconnects‘17NetworkBasedComputingLaboratory 44 Courtesy: http://on-demand.gputechconf.com/gtc/2017/presentation/s7724-minjie-wong-tofu-parallelizing-deep-learning.pdf
  • 45. Data Parallelism • Data parallel approaches to distributed training keep a copy of the entire model on each worker machine, processing different subsets of the training data set on each.
  • 46. Data Parallelism • Data parallel approaches to distributed training keep a copy of the entire model on each worker machine, processing different subsets of the training data set on each. • Data parallel training approaches all require some method of combining results and synchronizing the model parameters between each worker • Approaches: • Parameter averaging vs. update (gradient)-based approaches • Synchronous vs. asynchronous methods • Centralized vs. distributed synchronization
  • 47. Parameter Averaging • Parameter averaging is the conceptually simplest approach to data parallelism. With parameter averaging, training proceeds as follows: 1. Initialize the network parameters randomly based on the model configuration 2. Distribute a copy of the current parameters to each worker 3. Train each worker on a subset of the data 4. Set the global parameters to the average the parameters from each worker 5. While there is more data to process, go to step 2
  • 49. Multi GPU - Data Parallelism on Keras! https://keras.io/utils/#multi_gpu_model
  • 50. Asynchronous Stochastic Gradient Descent • An ‘update based’ data parallelism. • The primary difference between the two is that instead of transferring parameters from the workers to the parameter server, we will transfer the updates (i.e., gradients post learning rate and momentum, etc.) instead.
  • 52. When to Use Distributed Deep Learning?
  • 54. Common GPU stack for Data science Pycharm IDE High level API Driver Hardware Platform / backend Library
  • 55. Challenges • Linux • SUDO • Many packages • Env path • GPU has strong CPU 
  • 56. GPU configuration 1. Tensorflow GPU • pip install tensorflow-gpu https://www.tensorflow.org/install/gpu
  • 57. GPU configuration 1. Tensorflow GPU • pip install tensorflow-gpu 2. Software requirements - Install CUDA with apt • NVIDIA® GPU drivers —CUDA 10.0 requires 410.x or higher. • CUDA® Toolkit —TensorFlow supports CUDA 10.0 (TensorFlow >= 1.13.0) • CUPTI ships with the CUDA Toolkit. • cuDNN SDK (>= 7.4.1) https://www.tensorflow.org/install/gpu
  • 58. GPU configuration # Add NVIDIA package repositories # Add HTTPS support for apt-key sudo apt-get install gnupg-curl wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.130-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu1604_10.0.130-1_amd64.deb sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub sudo apt-get update wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb sudo apt-get update # Install NVIDIA Driver # Issue with driver install requires creating /usr/lib/nvidia sudo mkdir /usr/lib/nvidia sudo apt-get install --no-install-recommends nvidia-410 # Reboot. Check that GPUs are visible using the command: nvidia-smi # Install development and runtime libraries (~4GB) sudo apt-get install --no-install-recommends cuda-10-0 libcudnn7=7.4.1.5-1+cuda10.0 libcudnn7-dev=7.4.1.5-1+cuda10.0 # Install TensorRT. Requires that libcudnn7 is installed above. sudo apt-get update && sudo apt-get install nvinfer-runtime-trt-repo-ubuntu1604-5.0.2-ga-cuda10.0 && sudo apt-get update && sudo apt-get install -y --no-install-recommends libnvinfer-dev=5.0.2-1+cuda10.0 https://www.tensorflow.org/install/gpu
  • 59. NVIDIA System Management Interface • The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. • This utility allows administrators to query GPU device state and with the appropriate privileges, permits administrators to modify GPU device state. It is targeted at the TeslaTM, GRIDTM, QuadroTM and Titan X product, though limited support is also available on other NVIDIA GPUs. https://developer.nvidia.com/nvidia-system-management-interface
  • 60. NVIDIA System Management Interface watch --interval 1 nvidia-smi MobaXterm
  • 61. Code Validation import tensorflow as tf print(tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)) from tensorflow.python.client import device_lib local_device_protos = device_lib.list_local_devices() print local_device_protos https://github.com/liorsidi/GPU_deep_demo
  • 62. Code Validation import tensorflow as tf print(tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)) from tensorflow.python.client import device_lib local_device_protos = device_lib.list_local_devices() print local_device_protos https://github.com/liorsidi/GPU_deep_demo
  • 66. 2. remote interpreter configuration
  • 67. 2. remote interpreter configuration
  • 68. 3. remote interpreter folders Now wait
  • 69.
  • 72. Inference - Spark 1. Install tensorflow & Keras on each node 2. Train a model on GPU 3. Save model as H5 file 4. Define batch size based on executor memory size & network size 5. Load the saved model on each node in the cluster 6. Run Code 1. Base on RDD 2. Use map partition to call executors code: 1. Load model 2. Predict_on_batch
  • 73. Inference – Spark Code import pandas as pd from keras.models import load_model, Sequential from pyspark.sql.types import Row def keras_spark_predict(model_path, weights_path, partition): # load model model = Sequential.from_config(model_path.value) model.set_weights(weights_path.value) # Create a list containing features. featurs_list = map(lambda x: [x[:]], partition) featurs_df = pd.DataFrame(featurs_list) # predict with keras model predictions = model.predict_on_batch(featurs_df) predictions_return = map(lambda prediction: Row(prediction=prediction[0].item()), predictions) return iter(predictions_return) rdd = rdd.mapPartitions(lambda partition: keras_spark_predict(model_path, weights_path, partition)) https://github.com/liorsidi/GPU_deep_demo
  • 74. Keep in mind other newer approaches • Spark • sparkflow • TensorFlowOnSpark • spark-deep-learning
  • 78.
  • 81. Batch sizing • Batch size • Inference = Network parameters • Fully connected layers = #outputs x #inputs (weights) + #output (Bias) • In keras: • Summary • trainable_weights • non_trainable_weights • (+ Data generators) https://towardsdatascience.com/understanding-and-calculating-the-number-of-parameters-in-convolution-neural- networks-cnns-fc88790d530d
  • 82. Back to the Demo #https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model def get_model_memory_usage(batch_size, model): import numpy as np from keras import backend as K shapes_mem_count = 0 for l in model.layers: single_layer_mem = 1 for s in l.output_shape: if s is None: continue single_layer_mem *= s shapes_mem_count += single_layer_mem trainable_count = np.sum([K.count_params(p) for p in set(model.trainable_weights)]) non_trainable_count = np.sum([K.count_params(p) for p in set(model.non_trainable_weights)]) number_size = 4.0 if K.floatx() == 'float16': number_size = 2.0 if K.floatx() == 'float64': number_size = 8.0 total_memory = number_size*(batch_size*shapes_mem_count + trainable_count + non_trainable_count) gbytes = np.round(total_memory / (1024.0 ** 3), 3) return gbytes https://github.com/liorsidi/GPU_deep_demo
  • 83. To summarize • GPU are Awesome • Mind the batch size • Monitor your GPU (validate for every tf software update) • Work with Pycharm – remote interpreter • Separate between training and inference • Consider using free cloud tier • Fast.ai
  • 84. Thank you & Good luck!
  • 85. Tips for winning data hackathons • Separate roles: • Domain expert – explore the data, define features, read papers, metrics • Data engineer – preprocess data, extract feature, evaluation pipeline • Data scientist – algorithm development, evaluation, hyper tuning • Evaluation – avoid overfitting - someone is trying to trick you • Be consistent with your plan and feature exploration • Limited data • Augmentation • Extreme regularizations • Creativity • Think out of the box • Use state of the art tools • Save time and rest