2. Outline
• Introduction (slide 3)
• Facing the problems in GPGPU (slide 13)
• rCUDA functionality (slide 22)
• New rCUDA version (slide 29)
• Getting rCUDA (slide 42)
As this presentation includes a lot of
information, the reader can directly go to
the section most interesting to her/him by
leveraging the slide number information
3. Improving application performance
• The complexity of current applications makes
their execution time to be extremely high
• There is the trend to accelerate parts of their
code by using GPUs
4. GPU computing: the building block
The basic building block is a node with one
or more GPUs
Main Memory
mem
GPU
mem
GPU
Network
GPU
GPU
CPU
PCI-e
mem
mem
mem
mem
GPU
GPU
GPU
GPU
Main Memory
Network
GPU
GPU
GPU
GPU
CPU
PCI-e
5. GPU computing: programmer view
From the programming point of view:
A set of nodes, each one with:
one or more CPUs (several cores per CPU)
one or more GPUs (1-4)
An interconnection network
GPU GPU GPU GPU GPU GPU
GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem
GPU GPU GPU GPU GPU GPU
GPU mem GPU mem GPU mem GPU mem GPU mem GPU mem
PCI-e
PCI-e
PCI-e
PCI-e
PCI-e
PCI-e
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
CPU
CPU
CPU
CPU
CPU
CPU
Network Network Network Network Network Network
Interconnection Network
6. Not all kinds of code are eligible for GPUs
• For the right kind of code the use of GPUs
brings huge benefits in terms of performance
and energy
• There must be data parallelism in the code:
this is the only way to take benefit from the
hundreds of processors inside a GPU
• We can find different scenarios from the point
of view of the application:
− Low level of data parallelism
− High level of data parallelism
− Moderate level of data parallelism
− Applications for multi-GPU computing
7. Low and high level of data parallelism
Low level of data parallelism
Regarding GPU computing?
No GPU is needed, just proceed with the traditional HPC
strategies
High level of data parallelism
Regarding GPU computing?
Add one or more GPUs to every node in the system and
rewrite applications to use them
8. Moderate level of data parallelism
Application presents a data parallelism
around [40%-80%]. This is the common case
Regarding GPU computing?
The GPUs in the system are used only for some parts of the
application, remaining idle the rest of the time
9. Leak of money in current clusters
• The GPUs of a CUDA-enabled cluster may
be idle for long periods of time
• Waste of resources and energy
• The total cost of ownership (TCO) is no
longer dominated by acquisition costs but
electricity bill and rack space are
increasingly contributing
10. Last scenario: multi-GPU computing
• An application can use a large amount of
GPUs in parallel
Regarding GPU computing?
The code running in a node can only access the GPUs in that
node, but it would run faster if it could have access to more
GPUs
11. GPU computing presents drawbacks
• Although GPUs effectively accelerate
applications, their use may bring additional
concerns
12. Outline
• Introduction (slide 3)
• Facing the problems in GPGPU (slide 13)
• rCUDA functionality (slide 22)
• New rCUDA version (slide 29)
• Getting rCUDA (slide 42)
13. Looking for an efficient solution
• A way of avoiding the low GPU
utilization inefficiency is by
reducing the number of GPUs
and sharing the remaining
ones among the CPU nodes in
the cluster
• This would increase GPU
utilization also reducing power
consumption
14. Saving costs by doing better
• Doing better by spending
less money in GPUs at the
initial investment and
therefore reducing TCO
• Doing better by deploying
rCUDA into your new cost-
effective cluster
15. rCUDA adds value
• rCUDA allows sharing GPUs among
nodes in the cluster
• rCUDA allows having less GPUs
than nodes in the cluster
• rCUDA provides remote access
from each node to any GPU in the
system
• rCUDA reduces costs without
noticeably reducing performance
16. The main idea behind rCUDA
Add only the
GPU computing
nodes giving you
the necessary
computational
power!
Make all the
GPUs accesible
from evey node
17. rCUDA also extends CUDA’s possibilities
• rCUDA allows providing all the
GPUs in the cluster to a single
application
• Current limitations in the number
of GPUs per node are avoided
• Useful for multi-GPU computing:
now the only limit is the
programmer’s ability to
accelerate her/his application
18. rCUDA for multi-GPU computing
• All GPUs available to every node
GPU mem GPU mem GPU mem GPU mem GPU mem
GPU mem GPU mem GPU mem GPU mem GPU mem
PCI-e
PCI-e
PCI-e
PCI-e
PCI-e
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
CPU CPU CPU CPU CPU
Network Network Network Network Network
Interconnection Network
Currently, from a given CPU it is only possible to access
the GPUs in that very same node
19. rCUDA for multi-GPU computing
• All GPUs available to every node
GPU mem GPU mem GPU mem GPU mem GPU mem
GPU mem GPU mem GPU mem GPU mem GPU mem
Logical connections
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
CPU CPU CPU CPU CPU
PCI-e
PCI-e
PCI-e
PCI-e
PCI-e
Network Network Network Network Network
Interconnection Network
rCUDA makes all GPUs accessible from every node
20. rCUDA for multi-GPU computing
• All GPUs available to every node
GPU mem GPU mem GPU mem GPU mem GPU mem
GPU mem GPU mem GPU mem GPU mem GPU mem
Logical connections
Main Memory
Main Memory
Main Memory
Main Memory
Main Memory
CPU CPU CPU CPU CPU
PCI-e
PCI-e
PCI-e
PCI-e
PCI-e
Network Network Network Network Network
Interconnection Network
rCUDA makes all GPUs accessible from every node and
enables the access from a CPU to as many as required GPUs
21. Outline
• Introduction (slide 3)
• Facing the problems in GPGPU (slide 13)
• rCUDA functionality (slide 22)
• New rCUDA version (slide 29)
• Getting rCUDA (slide 42)
22. How rCUDA works
• rCUDA is a middleware that enables
seamlessly remote CUDA usage
• rCUDA clusters are equipped with:
− The rCUDA client at every node
− The rCUDA server only in those
nodes having a GPU
• Client-server communication:
General TCP/IP communications
• Or alternatively highly efficient low-
level communications library for
InfiniBand networks
23. Seamlessly usage
Usual way to use GPUs CURRENTLY
Application
CUDA library
24. Seamlessly usage
rCUDA leverages a client and a server
Client side Server side
Application
CUDA library
25. Seamlessly usage
rCUDA leverages a client and a server
Client side Server side
Application
rCUDA library rCUDA daemon
Network interface Network interface CUDA library
26. Seamlessly usage
rCUDA leverages a client and a server
Client side Server side
Application
rCUDA library rCUDA daemon
Network interface Network interface CUDA library
Request
Response
27. rCUDA uses a proprietary communication protocol
Example:
1) initialization
2) memory allocation
on the remote GPU
3) CPU to GPU memory
transfer of the input
data
4) kernel execution
5) GPU to CPU memory
transfer of the results
6) GPU memory release
7) communication
channel closing and
server process
finalization
28. Outline
• Introduction (slide 3)
• Facing the problems in GPGPU (slide 13)
• rCUDA functionality (slide 22)
• New rCUDA version (slide 29)
• Getting rCUDA (slide 42)
29. Features in the new rCUDA version
• CUDA 5 support
• Efficient InfiniBand support
• Support for CUDA extensions to C
• Multithread support
• Support for providing a single application
with multiple GPUs across the cluster
30. New Infiniband support
Why InfiniBand support?
InfiniBand is the most used HPC network
− Low latency and high bandwidth
TOP 500
31. New Infiniband support
Use of IB-Verbs
• All TPC/IP stack overhead is out
Bandwidth among client and remote GPU
near the peak InfiniBand network bandwidth
Use of GPUDirect
Reduce the number of intra-node data
movements
Use of pipelined transfers
Overlap intra-node data movements with
transfers
32. Performance example for InfiniBand
Remote GPU bandwidth for
synchronous transfers Enhanced performance
To GPU From GPU Internal algorithm
making use of pinned
memory
Bandwidth (MB/s)
4000
Maximum BW InfiniBand QDR
3000 exploitation Maximum BW
2000 nVidia
Tesla
1000
C2050
0
rCUDA, rCUDA, rCUDA, Local
GigaE IPoIB IB Verbs GPU
Low-level
1 Gbps IP over
InfiniBand
Ethernet InfiniBand
library
33. Performance example for InfiniBand
2. Using a remote
GPU through rCUDA 3. Therefore,
is only slightly slower employing a remote
than local GPU GPU is much faster
than a local CPU
Matrix-matrix product 4096 x 4096
rCUDA 40G 50% nVidia
InfiniBand GeForce
9800 GTX
CUBLAS 3.2
CUDA 48%
Intel
100% Xeon E5645
CPU MKL
1. Local GPU
computation is 0 0,5 1 1,5
much faster
than CPU
Execution time (seconds)
35. Performance example for InfiniBand
Execution time for the LAMMPS application, in.eam input script
scaled by a factor of 5 in the three dimensions
Tesla C2050
Intel Xeon E5520
QDR InfiniBand
36. Support for CUDA extensions to C
Previously, rCUDA did not support the CUDA
extensions to C
In order to execute a program with rCUDA, the
CUDA extensions included in its code had to be
“unextended” to the plain C API
NVCC inserts calls to
undocumented CUDA functions
37. Support for CUDA extensions to C
The new rCUDA version to be released will
support the CUDA extensions to C
The exact way we have achieved this goal
will not be disclosed within this document. We
ask for some patience …
38. Multithread Support
The new rCUDA version supports applications
with multiple threads
All the threads from the application can
access the remote GPU concurrently in the
same way as if the GPU was installed in the
node executing the application
39. MultiGPU support for a single application
The new rCUDA version is able to provide a
single application all the GPUs in the cluster
Accelerating applications no longer depends
on the amount of GPUs that fit into a node
40. MultiGPU multithreaded support
The new rCUDA features can be granted to a
single application, so that each thread of the
application can access as many GPUs as it
requires
41. Outline
• Introduction (slide 3)
• Facing the problems in GPGPU (slide 13)
• rCUDA functionality (slide 22)
• New InfiniBand version (slide 29)
• Getting rCUDA (slide 42)
42. Getting rCUDA
Full InfiniBand version freely available:
− Enhanced client-server data transfers
− High-performance InfiniBand
communications library
− TCP/IP-based communications also
included for non-InfiniBand networks
http://www.rcuda.net