Gromacs and Kepler GPUs

Webinar 20130404

Molecular Simulation with
GROMACS on CUDA GPUs
Erik Lindahl

GROMACS is used on a
wide range of resources

We’re comfortably
on the single-μs
scale today

Larger machines often
mean larger systems,
not necessarily longer
simulations

Why use GPUs?
Throughput
• Sampling
• Free energy
• Cost efficiency
• Power efficiency
• Desktop simulation
• Upgrade old machines
• Low-end clusters

Performance
• Longer simulations
• Parallel GPU simulation
•
•

using Infiniband
High-end efficiency by
using fewer nodes
Reach timescales not
possible with CPUs

Many GPU programs today
Caveat emperor:
It is much easier to get a reference
problem/algorithm to scale
i.e., you see much better
relative scaling before
introducing any optimization on the CPU side
When comparing programs:
What matters is absolute performance
(ns/day), not the relative speedup!

Gromacs-4.5 with OpenMM
Previous version - what was the limitation?
Gromacs running
entirely on CPU as
a fancy interface
Actual simulation running
entirely on GPU
using OpenMM kernels
Only a few select algorithms worked
Multi-CPU sometimes beat GPU performance...

Why don’t we use the CPU too?

0.5-1 TFLOP
Random memory
access OK (not great)

Great for complex
latency-sensitive stuff
(domain decomposition, etc.)

~2 TFLOP
Random memory
access won’t work
Great for
throughput

Gromacs-4.6 next-generation GPU implementation:
Programming model
Domain decomposition
dynamic load balancing
1 MPI rank 1 MPI rank

1 MPI rank 1 MPI rank

CPU
N OpenMP N OpenMP
(PME) threads
threads

N OpenMP N OpenMP
threads
threads

Load balancing
GPU

1 GPU
context

1 GPU
context

Load balancing
1 GPU
context

1 GPU
context

Heterogeneous CPU-GPU acceleration in GROMACS-4.6

Wallclock time for an MD step:
~0.5 ms if we want to simulate 1μs/day
We cannot aﬀord to lose all previous acceleration tricks!

CPU trick 1: all-bond constraints
• •Δt limited by fast motions - 1fs
• SHAKE (iterative, slow) - 2fs
Remove bond vibrations

•
•

•

Problematic in parallel (won’t work)
Compromise: constrain h-bonds only 1.4fs

GROMACS (LINCS):

•
•
•
•
•
•

LINear Constraint Solver
Approximate matrix inversion expansion
Fast & stable - much better than SHAKE
Non-iterative
Enables 2-3 fs timesteps
Parallel: P-LINCS (from Gromacs 4.0)

LINCS:

t=2’
t=1

A) Move w/o constraint
t=2’’
t=1
B) Project out motion
along bonds
t=2
t=1
C) Correct for rotational
extension of bond

CPU trick 2: Virtual sites
•
•

Next fastest motions is H-angle and
rotations of CH3/NH2 groups
Try to remove them:

•
•
•
•

•

Ideal H position from heavy atoms.
CH3/NH2 groups are made rigid
Calculate forces, then project back onto heavy atoms
Integrate only heavy atom positions, reconstruct H’s

Enables 5fs timesteps!
θ
1-a

2

a

a

b

a

|d |

1-a

3

|b |

3fd

3fad

|c |

3out

4fd

Interactions

Degrees of Freedom

dista
actio
this i
intera
it to
tem.
Fo
tween
7
C’
3
B’ 2
1D d
C
B
decom
A’
4
6
A
sions
rc 5
1
0
allow
rc
3
1
detai
0
8th-sphere
comm
most
FIG. 3: The zones to communicate to the proces
FIG. 2: The domain decomposition cells (1-7)for details.
see the text that communinami
cate coordinates to cell 0. Cell 2 is hidden below cell 7. The
parts
zones that need to be communicated to cell 0 are dashed, rc
ensure that all bonded interaction between ch
cut-o
is the cut-o radius.
can be assigned to a processor, it is su⌅cien
balan
that the charge groups within a sphere of ra
muni
present on at least one processor for every p
ter of the sphere. In Fig. ?? this means we a
balan
communicate volumes B’ and C’. When no bo
are calculated.
in cel
actions are present between charge groups, th
are not communicated. For 2D decomposition
bond
Bonded interactions are distributed over the processors
C’ are the only extra volumes that need to
calcu
by ﬁnding the smallest x, y and z coordinate of the charge pictures be
For 3D domain decomposition the
be
tions
groups involved and assigning the ainteraction to thebut the procedure i
bit more complicated, pro-

CPU trick 3: Non-rectangular
cells & decomposition

Load balancing works
for arbitrary triclinic cells

Lysozyme, 25k atoms
All these “tricks” now work fine
Rhombic dodecahedron
(36k atoms in cubic cell) with GPUs in GROMACS-4.6!

apart from more extensive book-keeping. All

From neighborlists to cluster
pair lists in GROMACS-4.6
x,y grid
z sort
z bin
x,y,z
gridding

Organize
as tiles with
Cluster pairlist
all-vs-all
interactions:

X
X
X
X

X
X
X
X

X
X
X
X

X
X
X
X

Tiling circles is difficult
Need a lot of cubes
to cover a sphere
Interactions outside
cutoff should be 0.0
Group cutoff

•
•

Verlet cutoff

GROMACS-4.6 calculates a “large enough” buffer
zone so no interactions are missed
Optimize nstlist for performance - no need to
worry about missing any interactions with Verlet!

Tixel algorithm work-efficiency
8x8x8 tixels compared to a non performance-optimized Verlet scheme
0.36

rc=1.5, rl=1.6

0.58
0.82

Verlet
Tixel Pruned
Tixel non-pruned

0.29

rc=1.2, rl=1.3

0.52
0.75
0.21

rc=0.9, rl=1.0

0.42
0.73

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Highly memory-eﬃcient algorithm:
Can handle 20-40 million atoms with 2-3GB memory
Even cheap consumer cards will get you a long way

PME weak scaling
Xeon X5650 3T + C2075 / process
0.35

480 μs/step (1500 atoms)

1xC2075 CUDA F kernel
1xC2075 CPU total
2xC2075 CPU total
4xC2075 CPU total

Iteration time per 1000 atoms (ms/step)

0.3

0.25

0.2

Text
0.15

700 μs/step (6000 atoms)
0.1

Complete time step including
kernel, h2d, d2h, CPU constraints,
CPU PME, CPU integration,OpenMP & MPI

0.05

0
1.5

3

6

12

24

48

96

192

System size/GPU (1000s of atoms)

384

768

1536

3072

Example performance: Systems with
~24,000 atoms, 2 fs time steps, NPT
Amber:
DHFR

CPU, 96 CPU cores
GPU, 1xGTX680
GPU, 4xGTX680
0

Gromacs:
RNAse

100

ns/day

200

300

200

300

CPU, 6 cores
CPU, 2*8 cores

6 CPU cores +1xK20c GPU
6 CPU cores +1xGTX680 GPU
dodec+vsites(5fs), 6 CPU cores
dodec+vsites(5fs), 2*8 CPU cores
dodec+vsites(5fs), 6 cores + 1xK20c

dodec+vsites(5fs), 6 cores + 1xGTX680
0

100

The Villin headpiece
~8,000 atoms, 5 fs steps
explicit solvent
triclinic box
PME electrostatics
i7 3930K (GMX 4.5)
i7 3930K (GMX 4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0

200

400

600

ns/day

800

1000

2,546 FPS (beat that, Battlefield 4)

1200

GLIC: Ion channel
membrane protein
150,000 atoms
Running on a simple desktop!

i7 3930K (GMX4.5)
i7 3930K (GMX4.6)
i7 3930K+GTX680
E5-2690+GTX Titan
0

10

20

ns/day

30

40

Strong scaling of Reaction-Field and &
Scaling of Reaction-fieldPME PME
1.5M atoms waterbox, RF cutoff=0.9nm, PME auto-tuned cutoff

Performance (ns/day)

100

10

1
RF
RF linear scaling
PME
PME linear scaling

0.1
1

10

100

#Processes-GPUs

Challenge: GROMACS has very short iteration times hard requirements on latency/bandwidth
Small systems often work best using only a single GPU!

GROMACS 4.6 extreme scaling
Scaling to 130 atoms/core: ADH protein 134k atoms, PME, rc >= 0.9
1000

XK6/X2090
XK7/K20X
XK6 CPU only
XE6 CPU only

ns/day

100

10

1
1

2

4
8
16
#sockets (CPU or CPU+GPU)

32

64

Using GROMACS
with GPUs in practice

Compiling GROMACS with CUDA

•
•
•
•

Make sure CUDA driver is installed
Make sure CUDA SDK is in /usr/local/cuda
Use the default GROMACS distribution
Just run ‘cmake’ and we will detect CUDA
automatically and use it

•
•

gcc-4.7 works great as a compiler
On Macs, you want to use icc (commercial)

Longer Mac story: Clang does not support OpenMP,
which gcc does. However, the current gcc versions for
Macs do not support AVX on the CPU. icc supports both!

Using GPUs in practice
In your mdp file:
cutoff-scheme
nstlist
coulombtype
vdw-type
nstcalcenergy

=
=
=
=
=

Verlet
10
; likely 10-50
pme
; or reaction-field
cut-off
-1
; only when writing edr

• Verlet cutoﬀ-scheme is more accurate
• Necessary for GPUs in GROMACS
• Use -testverlet mdrun option to force it w. old tpr files
• Slower on a single CPU, but scales well on CPUs too!
Shift modifier is applied to both coulomb and VdW by
default on GPUs - change with coulomb/vdw-modifier

Load balancing
rcoulomb
fourierspacing

= 1.0
= 0.12

• If we increase/decrease the coulomb direct-space
•
•

cutoﬀ and the reciprocal space PME grid spacing by
the same amount, we maintain accuracy
... but we move work between CPU & GPU!
By default, GROMACS-4.6 does this automatically at
the start of each run - you will see diagnostic output
GROMACS excels when you combine a fairly fast
CPU and GPU. Currently, this means Intel CPUs.

Acknowledgments
•
•
•
•

GROMACS: Berk Hess, David v. der Spoel, Per Larsson, Mark Abraham
Gromacs-GPU: Szilard Pall, Berk Hess, Rossen Apostolov
Multi-Threaded PME: Roland Shultz, Berk Hess
Nvidia: Mark Berger, Scott LeGrand, Duncan Poole, and others!

Test Drive K20
GPUs!

Questions?

Run GROMACS on Tesla K20
GPU today

Devang Sachdev - NVIDIA
dsachdev@nvidia.com
@DevangSachdev

Contact us

Experience The Acceleration

Sign up for FREE GPU Test Drive
on remotely hosted clusters
www.nvidia.com/GPUTestDrive

GROMACS questions
Check
www.gromacs.org
gmx-users@gromacs.org

mailing
list
Stream other webinars from GTC
Express:
http://www.gputechconf.com/
page/gtc-express-webinar.html

Register for the Next GTC
Express Webinar
Molecular Shape Searching on GPUs
Paul Hawkins, Applications Science Group Leader, OpenEye
Wednesday, May 22, 2013, 9:00 AM PDT
Register at www.gputechconf.com/gtcexpress

Gromacs and Kepler GPUs

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

More from Can Ozdoruk

More from Can Ozdoruk (16)

Recently uploaded

Recently uploaded (20)

Gromacs and Kepler GPUs