DevEX - reference for building teams, processes, and platforms
Brief Overview of a Parallel Nbody Code
1. Brief overview of a
parallel nbody code
Implementation and analysis
Filipo Novo Mór
Graduate Program in Computer Science UFRGS
Prof. Nicollas Maillard
2013, December
2. Overview
• About the nbody problem
• The Serial Implementation
• The OpenMP Implementation
• The CUDA Implementation
• Experimental Results
• Conclusion
3. About the nbody problem
Features:
Force calculation between all particles.
Complexity O(N2)
Energy should be constant.
The brute force algorithm demands huge
computational power.
5. The Serial Implementation
• It stills under N2 domain, but:
• Each pair is evaluated once only.
• Acceleration it’s OK at the end!
6. The OpenMP Implementation
• MUST be based on the “naive” version.
• We lost the “/2”, but we gain the “/p”!
• OBS: the static schedule seems to be slightly faster than dynamic schedule.
22. Analysis
C : cost of the CalculateForce
function.
M : transfer cost between global and
shared memories.
T : transfer cost between CPU and
device memories.
Access to shared memory is
around 100X faster than to the
global memory.
23. Experimental Results
How much would it cost???
Testing Environment:
Dell PowerEdge R610
2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading
8 physical cores, 16 threads.
RAM 16GB
NVIDIA Tesla S2050
Ubuntu Server 10.0.4 LTS
GCC 4.4.3
CUDA 5.0
Version
Cost
Naive
$
0.49
Smart
$
0.33
OMP
$
0.08
CUDA
$
0.05
Amazon EC2:
General Purpose - m1.large plan
GPU Instances - g2.2xlarge plan
24. Conclusions
• PRAM is OK for sequential and OpenMP.
• But for CUDA, we need a better model!
– Considering block threads, warps and latency.
Thanks!