SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
CUDA and Caffe for 
deep learning 
Amgad Muhammad 
Mohamed Ghoneim
Outline 
• GPU Computing 
• What is CUDA? 
• Why use CUDA? 
• When use CUDA? 
• CUDA - Machine Specs . 
• CUDA - Matrix Multiplication 
• CUDA - Closest Pair in 2D 
• Convolution Neural Networks 
• Auto Encoder
GPU Computing 
• Moore’s law slowed down. 
• Computation is directed towards parallelism instead of 
better processing unit performance. 
• CPU has a small number of processing units with very 
high processing power. 
• GPU has a large number of processing units with 
moderate processing power.
What is CUDA? 
• Compute Unified Device Architecture 
• Introduced by nVidia in 2006. 
• Refers to 2 different concepts: 
1. CUDA Architecture: Massively parallel architecture of 
modern GPUs with hundreds of cores. 
2. CUDA Programming Model: the model used to program 
these GPUs
[Bryan Catanzaro]
Why use CUDA? 
• Efficiently processing thousands of small/repeated tasks 
in parallel. 
• It provides a methodology for these tasks to 
communicate and cooperate efficiently. 
• Scalable and intuitive mechanism to express 
parallelism.
When use CUDA? 
• Lots of computations and lots of data. 
• Parallel algorithms. 
• Neural Networks. 
• Physical Simulations 
• Distributed Computing 
• Accelerated Encryption, Decryption and Compression
CUDA – Machine Specs . 
Machine specs for this experiment: 
- Processor: Dual-core AMD Opteron(™) processor 2216 2.4 GHz (2 
processors). 
- RAM: 32.0 GB 
- OS: 64-bit Windows 7 
- Graphics Card: Quadro FX 4600 
- CUDA Driver: 5.5 
- CUDA Compatibility: 1.0 
- # of Cores: 96 
- Core Clock: 500MHz 
- Memory: 768MB 
- Memory Clock: 1400MHz
CUDA - Matrix Multiplication 
Comparing different implementations: 
All the times below are in milliseconds. 
100 200 300 400 500 600 700 800 900 1000 
25000 
20000 
15000 
10000 
5000 
0 
Matrix Multiplication 
Matrix Side 
CPU GPU 
Time in MS
CUDA - Closest Pair in 2D 
This is a well known problem where the 
algorithm tries to find the 2 points that 
closest to each other. There are many 
solutions to address this problem: 
1. Brute Force complexity O( n^2 ) 
2. Divide and Conquer O( n log(n) ) 
For completeness there is another implementation using KD-trees with complexity similar to D&C.
CUDA - Closest Pair in 2D (cont.) 
Comparing different implementations: 
All the times below are in milliseconds. 
100 1000 5000 10000 20000 25000 30000 40000 50000 100000 
250000 
200000 
150000 
100000 
50000 
0 
Closest Pair in 2D 
Number of Points 
Brute Force CPU BF GPU BF GPU Optimized 
Time in MS
CUDA - Closest Pair in 2D (cont.) 
Comparing different implementations: 
All the times below are in milliseconds. 
1000 
900 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Closest Pair in 2D 
100 1000 5000 10000 20000 25000 30000 40000 
Number of Points 
BF GPU Optimized Divide and Conquer CPU 
Time in MS
CUDA - Closest Pair in 2D (cont.) 
To explain how optimized GPU version works we need to review the threads hierarchy in 
the GPU works:
CUDA - Closest Pair in 2D (cont.) 
To explain how optimized GPU version works we need to review the memory hierarchy in 
the GPU works:
CUDA – back to Matrix Multiplication 
Explaining the matrix multiplication optimization on board
CUDA - Closest Pair in 2D (cont.) 
Explaining the optimized code on board 
__global__ void FindClosestGPU2(float2* points, float* vals, int count) 
{ 
__shared__ float2 sharedPoints[blockSize]; 
if(count <= 1) return; 
int idx = threadIdx.x + blockIdx.x * blockDim.x; 
float2 thisPoint; 
float distanceToClosest = FLT_MAX; 
if(idx < count) thisPoint = points[idx]; 
for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) { 
if(threadIdx.x + currentBlockOfPoints * blockSize < count) 
sharedPoints[threadIdx.x] = points[threadIdx.x + currentBlockOfPoints * blockSize]; 
else 
sharedPoints[threadIdx.x].x = reasonableINF, sharedPoints[threadIdx.x].y = reasonableINF; 
__syncthreads(); 
if(idx < count) { 
float *ptr = &sharedPoints[0].x; 
for(int i = 0; i < blockSize; i++) { 
float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]) + 
(thisPoint.y - ptr[1]) * (thisPoint.y - ptr[1]); 
ptr += 2; 
if(dist < distanceToClosest && (i + currentBlockOfPoints * blockSize < count) 
&& (i + currentBlockOfPoints * blockSize != idx)) 
distanceToClosest = dist; 
} 
}_ 
_syncthreads(); 
}i 
f(idx < count) 
vals[idx] = distanceToClosest; 
}
CNN
Convolution, The first operation to optimize
Pooling, the second operation to optimize
Results
LeNet Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. We used 
OpenBlas for parallelization on the CPU 
Due to the fact that the data set is small in size, the overhead wasn't compensated by the speedup. 
1 CPU Core 2 CPU Cores 3 CPU Cores 4 CPU Cores 
800 
700 
600 
500 
400 
300 
200 
100 
0 
CNN 
with GPU without GPU 
Time in Seconds
AutoEncoder
AutoEncoders Results 
The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. And the main 
operation here is inner product 
1 CPU Core 2 CPU Cores 3 CPU Cores 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Auto Encoder 
with GPU without GPU 
Time in Seconds
Thank You! 
Questions?

Contenu connexe

Tendances

PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...Preferred Networks
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportPreferred Networks
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slidesSara Asher
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017Yu-Hsun (lymanblue) Lin
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its FeaturesSeiya Tokui
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its applicationSrishty Saha
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Altoros
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)Dongheon Lee
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developersAbdul Muneer
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorchMayur Bhangale
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAI Frontiers
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 

Tendances (20)

Chainer v3
Chainer v3Chainer v3
Chainer v3
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
NAS EP Algorithm
NAS EP Algorithm NAS EP Algorithm
NAS EP Algorithm
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its application
 
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 

En vedette

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
GPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsGPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsMartin Peniak
 
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...Dataconomy Media
 
Back-propagation Primer
Back-propagation PrimerBack-propagation Primer
Back-propagation PrimerAuro Tripathy
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural networkIndira Nayak
 
Face recognition using artificial neural network
Face recognition using artificial neural networkFace recognition using artificial neural network
Face recognition using artificial neural networkSumeet Kakani
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSREHMAT ULLAH
 
neural network
neural networkneural network
neural networkSTUDENT
 
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...Sergio Orts-Escolano
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkmustafa aadel
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications Ahmed_hashmi
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkHiroshi Kuwajima
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networksstellajoseph
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Chiranjeevi Adi
 

En vedette (18)

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
GPU Computing for Cognitive Robotics
GPU Computing for Cognitive RoboticsGPU Computing for Cognitive Robotics
GPU Computing for Cognitive Robotics
 
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...How Zalando accelerates warehouse operations with neural networks - Calvin Se...
How Zalando accelerates warehouse operations with neural networks - Calvin Se...
 
Back-propagation Primer
Back-propagation PrimerBack-propagation Primer
Back-propagation Primer
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural network
 
Face recognition using artificial neural network
Face recognition using artificial neural networkFace recognition using artificial neural network
Face recognition using artificial neural network
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
 
neural network
neural networkneural network
neural network
 
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
A Three-Dimensional Representation method for Noisy Point Clouds based on Gro...
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks
 

Similaire à CUDA and Caffe for deep learning

Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Yukio Saito
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 

Similaire à CUDA and Caffe for deep learning (20)

Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 

Plus de Amgad Muhammad

Improving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationImproving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationAmgad Muhammad
 
Auto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAuto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAmgad Muhammad
 
Android Performance Best Practices
Android Performance Best Practices Android Performance Best Practices
Android Performance Best Practices Amgad Muhammad
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature LearningAmgad Muhammad
 

Plus de Amgad Muhammad (6)

Improving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimizationImproving region based CNN object detector using bayesian optimization
Improving region based CNN object detector using bayesian optimization
 
Auto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological backgroundAuto-Encoders and PCA, a brief psychological background
Auto-Encoders and PCA, a brief psychological background
 
Android Performance Best Practices
Android Performance Best Practices Android Performance Best Practices
Android Performance Best Practices
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
 
Google File System
Google File SystemGoogle File System
Google File System
 
Python
PythonPython
Python
 

Dernier

Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 

Dernier (20)

Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

CUDA and Caffe for deep learning

  • 1. CUDA and Caffe for deep learning Amgad Muhammad Mohamed Ghoneim
  • 2. Outline • GPU Computing • What is CUDA? • Why use CUDA? • When use CUDA? • CUDA - Machine Specs . • CUDA - Matrix Multiplication • CUDA - Closest Pair in 2D • Convolution Neural Networks • Auto Encoder
  • 3. GPU Computing • Moore’s law slowed down. • Computation is directed towards parallelism instead of better processing unit performance. • CPU has a small number of processing units with very high processing power. • GPU has a large number of processing units with moderate processing power.
  • 4. What is CUDA? • Compute Unified Device Architecture • Introduced by nVidia in 2006. • Refers to 2 different concepts: 1. CUDA Architecture: Massively parallel architecture of modern GPUs with hundreds of cores. 2. CUDA Programming Model: the model used to program these GPUs
  • 6. Why use CUDA? • Efficiently processing thousands of small/repeated tasks in parallel. • It provides a methodology for these tasks to communicate and cooperate efficiently. • Scalable and intuitive mechanism to express parallelism.
  • 7. When use CUDA? • Lots of computations and lots of data. • Parallel algorithms. • Neural Networks. • Physical Simulations • Distributed Computing • Accelerated Encryption, Decryption and Compression
  • 8. CUDA – Machine Specs . Machine specs for this experiment: - Processor: Dual-core AMD Opteron(™) processor 2216 2.4 GHz (2 processors). - RAM: 32.0 GB - OS: 64-bit Windows 7 - Graphics Card: Quadro FX 4600 - CUDA Driver: 5.5 - CUDA Compatibility: 1.0 - # of Cores: 96 - Core Clock: 500MHz - Memory: 768MB - Memory Clock: 1400MHz
  • 9. CUDA - Matrix Multiplication Comparing different implementations: All the times below are in milliseconds. 100 200 300 400 500 600 700 800 900 1000 25000 20000 15000 10000 5000 0 Matrix Multiplication Matrix Side CPU GPU Time in MS
  • 10. CUDA - Closest Pair in 2D This is a well known problem where the algorithm tries to find the 2 points that closest to each other. There are many solutions to address this problem: 1. Brute Force complexity O( n^2 ) 2. Divide and Conquer O( n log(n) ) For completeness there is another implementation using KD-trees with complexity similar to D&C.
  • 11. CUDA - Closest Pair in 2D (cont.) Comparing different implementations: All the times below are in milliseconds. 100 1000 5000 10000 20000 25000 30000 40000 50000 100000 250000 200000 150000 100000 50000 0 Closest Pair in 2D Number of Points Brute Force CPU BF GPU BF GPU Optimized Time in MS
  • 12. CUDA - Closest Pair in 2D (cont.) Comparing different implementations: All the times below are in milliseconds. 1000 900 800 700 600 500 400 300 200 100 0 Closest Pair in 2D 100 1000 5000 10000 20000 25000 30000 40000 Number of Points BF GPU Optimized Divide and Conquer CPU Time in MS
  • 13. CUDA - Closest Pair in 2D (cont.) To explain how optimized GPU version works we need to review the threads hierarchy in the GPU works:
  • 14. CUDA - Closest Pair in 2D (cont.) To explain how optimized GPU version works we need to review the memory hierarchy in the GPU works:
  • 15. CUDA – back to Matrix Multiplication Explaining the matrix multiplication optimization on board
  • 16. CUDA - Closest Pair in 2D (cont.) Explaining the optimized code on board __global__ void FindClosestGPU2(float2* points, float* vals, int count) { __shared__ float2 sharedPoints[blockSize]; if(count <= 1) return; int idx = threadIdx.x + blockIdx.x * blockDim.x; float2 thisPoint; float distanceToClosest = FLT_MAX; if(idx < count) thisPoint = points[idx]; for(int currentBlockOfPoints = 0; currentBlockOfPoints < gridDim.x; currentBlockOfPoints++) { if(threadIdx.x + currentBlockOfPoints * blockSize < count) sharedPoints[threadIdx.x] = points[threadIdx.x + currentBlockOfPoints * blockSize]; else sharedPoints[threadIdx.x].x = reasonableINF, sharedPoints[threadIdx.x].y = reasonableINF; __syncthreads(); if(idx < count) { float *ptr = &sharedPoints[0].x; for(int i = 0; i < blockSize; i++) { float dist = (thisPoint.x - ptr[0]) * (thisPoint.x - ptr[0]) + (thisPoint.y - ptr[1]) * (thisPoint.y - ptr[1]); ptr += 2; if(dist < distanceToClosest && (i + currentBlockOfPoints * blockSize < count) && (i + currentBlockOfPoints * blockSize != idx)) distanceToClosest = dist; } }_ _syncthreads(); }i f(idx < count) vals[idx] = distanceToClosest; }
  • 17. CNN
  • 18. Convolution, The first operation to optimize
  • 19. Pooling, the second operation to optimize
  • 21. LeNet Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. We used OpenBlas for parallelization on the CPU Due to the fact that the data set is small in size, the overhead wasn't compensated by the speedup. 1 CPU Core 2 CPU Cores 3 CPU Cores 4 CPU Cores 800 700 600 500 400 300 200 100 0 CNN with GPU without GPU Time in Seconds
  • 23. AutoEncoders Results The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. And the main operation here is inner product 1 CPU Core 2 CPU Cores 3 CPU Cores 800 700 600 500 400 300 200 100 0 Auto Encoder with GPU without GPU Time in Seconds