Learn about the first multi-node, multi-GPU-enabled release 4.6 of GROMACS from Dr. Erik Lindahl, the project leader for this popular molecular dynamics package. GROMACS 4.6 allows you to run your models up to 3X faster compare to the latest state-of-the-art parallel AVX-accelerated CPU-code in GROMACS. Dr. Lindahl talks about the new features of the latest GROMACS 4.6 release as well as future plans. You will learn how to download the latest accelerated version of GROMACS and which features are GPU supported. Dr. Lindahl covers GROMACS performance on the very latest NVIDIA Kepler hardware and explain how to run GPU-accelerated MD simulations. You will also be invited to try GROMACS on K40 with a free test drive and experience all the new features and enhanced performance for yourself: www.Nvidia.com/GPUTestDrive
2. GROMACS is used on a
wide range of resources
We’re comfortably
on the single-μs
scale today
Larger machines often
mean larger systems,
not necessarily longer
simulations
3. Why use GPUs?
Throughput
• Sampling
• Free energy
• Cost efficiency
• Power efficiency
• Desktop simulation
• Upgrade old machines
• Low-end clusters
Performance
• Longer simulations
• Parallel GPU simulation
•
•
using Infiniband
High-end efficiency by
using fewer nodes
Reach timescales not
possible with CPUs
4. Many GPU programs today
Caveat emperor:
It is much easier to get a reference
problem/algorithm to scale
i.e., you see much better
relative scaling before
introducing any optimization on the CPU side
When comparing programs:
What matters is absolute performance
(ns/day), not the relative speedup!
5.
6. Gromacs-4.5 with OpenMM
Previous version - what was the limitation?
Gromacs running
entirely on CPU as
a fancy interface
Actual simulation running
entirely on GPU
using OpenMM kernels
Only a few select algorithms worked
Multi-CPU sometimes beat GPU performance...
7. Why don’t we use the CPU too?
0.5-1 TFLOP
Random memory
access OK (not great)
Great for complex
latency-sensitive stuff
(domain decomposition, etc.)
~2 TFLOP
Random memory
access won’t work
Great for
throughput
8. Gromacs-4.6 next-generation GPU implementation:
Programming model
Domain decomposition
dynamic load balancing
1 MPI rank 1 MPI rank
1 MPI rank 1 MPI rank
CPU
N OpenMP N OpenMP
(PME) threads
threads
N OpenMP N OpenMP
threads
threads
Load balancing
GPU
1 GPU
context
1 GPU
context
Load balancing
1 GPU
context
1 GPU
context
9. Heterogeneous CPU-GPU acceleration in GROMACS-4.6
Wallclock time for an MD step:
~0.5 ms if we want to simulate 1μs/day
We cannot afford to lose all previous acceleration tricks!
10. CPU trick 1: all-bond constraints
• •Δt limited by fast motions - 1fs
• SHAKE (iterative, slow) - 2fs
Remove bond vibrations
•
•
•
Problematic in parallel (won’t work)
Compromise: constrain h-bonds only 1.4fs
GROMACS (LINCS):
•
•
•
•
•
•
LINear Constraint Solver
Approximate matrix inversion expansion
Fast & stable - much better than SHAKE
Non-iterative
Enables 2-3 fs timesteps
Parallel: P-LINCS (from Gromacs 4.0)
LINCS:
t=2’
t=1
A) Move w/o constraint
t=2’’
t=1
B) Project out motion
along bonds
t=2
t=1
C) Correct for rotational
extension of bond
11. CPU trick 2: Virtual sites
•
•
Next fastest motions is H-angle and
rotations of CH3/NH2 groups
Try to remove them:
•
•
•
•
•
Ideal H position from heavy atoms.
CH3/NH2 groups are made rigid
Calculate forces, then project back onto heavy atoms
Integrate only heavy atom positions, reconstruct H’s
Enables 5fs timesteps!
θ
1-a
2
a
a
b
a
|d |
1-a
3
|b |
3fd
3fad
|c |
3out
4fd
Interactions
Degrees of Freedom
12. dista
actio
this i
intera
it to
tem.
Fo
tween
7
C’
3
B’ 2
1D d
C
B
decom
A’
4
6
A
sions
rc 5
1
0
allow
rc
3
1
detai
0
8th-sphere
comm
most
FIG. 3: The zones to communicate to the proces
FIG. 2: The domain decomposition cells (1-7)for details.
see the text that communinami
cate coordinates to cell 0. Cell 2 is hidden below cell 7. The
parts
zones that need to be communicated to cell 0 are dashed, rc
ensure that all bonded interaction between ch
cut-o
is the cut-o radius.
can be assigned to a processor, it is su⌅cien
balan
that the charge groups within a sphere of ra
muni
present on at least one processor for every p
ter of the sphere. In Fig. ?? this means we a
balan
communicate volumes B’ and C’. When no bo
are calculated.
in cel
actions are present between charge groups, th
are not communicated. For 2D decomposition
bond
Bonded interactions are distributed over the processors
C’ are the only extra volumes that need to
calcu
by finding the smallest x, y and z coordinate of the charge pictures be
For 3D domain decomposition the
be
tions
groups involved and assigning the ainteraction to thebut the procedure i
bit more complicated, pro-
CPU trick 3: Non-rectangular
cells & decomposition
Load balancing works
for arbitrary triclinic cells
Lysozyme, 25k atoms
All these “tricks” now work fine
Rhombic dodecahedron
(36k atoms in cubic cell) with GPUs in GROMACS-4.6!
apart from more extensive book-keeping. All
13. From neighborlists to cluster
pair lists in GROMACS-4.6
x,y grid
z sort
z bin
x,y,z
gridding
Organize
as tiles with
Cluster pairlist
all-vs-all
interactions:
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
14. Tiling circles is difficult
Need a lot of cubes
to cover a sphere
Interactions outside
cutoff should be 0.0
Group cutoff
•
•
Verlet cutoff
GROMACS-4.6 calculates a “large enough” buffer
zone so no interactions are missed
Optimize nstlist for performance - no need to
worry about missing any interactions with Verlet!
15. Tixel algorithm work-efficiency
8x8x8 tixels compared to a non performance-optimized Verlet scheme
0.36
rc=1.5, rl=1.6
0.58
0.82
Verlet
Tixel Pruned
Tixel non-pruned
0.29
rc=1.2, rl=1.3
0.52
0.75
0.21
rc=0.9, rl=1.0
0.42
0.73
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Highly memory-efficient algorithm:
Can handle 20-40 million atoms with 2-3GB memory
Even cheap consumer cards will get you a long way
16. PME weak scaling
Xeon X5650 3T + C2075 / process
0.35
480 μs/step (1500 atoms)
1xC2075 CUDA F kernel
1xC2075 CPU total
2xC2075 CPU total
4xC2075 CPU total
Iteration time per 1000 atoms (ms/step)
0.3
0.25
0.2
Text
0.15
700 μs/step (6000 atoms)
0.1
Complete time step including
kernel, h2d, d2h, CPU constraints,
CPU PME, CPU integration,OpenMP & MPI
0.05
0
1.5
3
6
12
24
48
96
192
System size/GPU (1000s of atoms)
384
768
1536
3072
17. Example performance: Systems with
~24,000 atoms, 2 fs time steps, NPT
Amber:
DHFR
CPU, 96 CPU cores
GPU, 1xGTX680
GPU, 4xGTX680
0
Gromacs:
RNAse
100
ns/day
200
300
200
300
CPU, 6 cores
CPU, 2*8 cores
6 CPU cores +1xK20c GPU
6 CPU cores +1xGTX680 GPU
dodec+vsites(5fs), 6 CPU cores
dodec+vsites(5fs), 2*8 CPU cores
dodec+vsites(5fs), 6 cores + 1xK20c
dodec+vsites(5fs), 6 cores + 1xGTX680
0
100
19. GLIC: Ion channel
membrane protein
150,000 atoms
Running on a simple desktop!
i7 3930K (GMX4.5)
i7 3930K (GMX4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0
10
20
ns/day
30
40
20. Strong scaling of Reaction-Field and &
Scaling of Reaction-fieldPME PME
1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff
Performance (ns/day)
100
10
1
RF
RF linear scaling
PME
PME linear scaling
0.1
1
10
100
#Processes-GPUs
Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth
Small systems often work best using only a single GPU!
21. GROMACS 4.6 extreme scaling
Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9
1000
XK6/X2090
XK7/K20X
XK6 CPU only
XE6 CPU only
ns/day
100
10
1
1
2
4
8
16
#sockets (CPU or CPU+GPU)
32
64
23. Compiling GROMACS with CUDA
•
•
•
•
Make sure CUDA driver is installed
Make sure CUDA SDK is in /usr/local/cuda
Use the default GROMACS distribution
Just run ‘cmake’ and we will detect CUDA
automatically and use it
•
•
gcc-4.7 works great as a compiler
On Macs, you want to use icc (commercial)
Longer Mac story: Clang does not support OpenMP,
which gcc does. However, the current gcc versions for
Macs do not support AVX on the CPU. icc supports both!
24. Using GPUs in practice
In your mdp file:
cutoff-scheme
nstlist
coulombtype
vdw-type
nstcalcenergy
=
=
=
=
=
Verlet
10
; likely 10-50
pme
; or reaction-field
cut-off
-1
; only when writing edr
• Verlet cutoff-scheme is more accurate
• Necessary for GPUs in GROMACS
• Use -testverlet mdrun option to force it w. old tpr files
• Slower on a single CPU, but scales well on CPUs too!
Shift modifier is applied to both coulomb and VdW by
default on GPUs - change with coulomb/vdw-modifier
25. Load balancing
rcoulomb
fourierspacing
= 1.0
= 0.12
• If we increase/decrease the coulomb direct-space
•
•
cutoff and the reciprocal space PME grid spacing by
the same amount, we maintain accuracy
... but we move work between CPU & GPU!
By default, GROMACS-4.6 does this automatically at
the start of each run - you will see diagnostic output
GROMACS excels when you combine a fairly fast
CPU and GPU. Currently, this means Intel CPUs.
27. Acknowledgments
•
•
•
•
GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham
Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov
Multi-Threaded PME: Roland Shultz, Berk Hess
Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!
28. Test Drive K20
GPUs!
Questions?
Run GROMACS on Tesla K20
GPU today
Devang Sachdev - NVIDIA
dsachdev@nvidia.com
@DevangSachdev
Contact us
Experience The Acceleration
Sign up for FREE GPU Test Drive
on remotely hosted clusters
www.nvidia.com/GPUTestDrive
GROMACS questions
Check
www.gromacs.org
gmx-users@gromacs.org
mailing
list
Stream other webinars from GTC
Express:
http://www.gputechconf.com/
page/gtc-express-webinar.html
29. Register for the Next GTC
Express Webinar
Molecular Shape Searching on GPUs
Paul Hawkins, Applications Science Group Leader, OpenEye
Wednesday, May 22, 2013, 9:00 AM PDT
Register at www.gputechconf.com/gtcexpress