Slide tesi

Alma Mater Studiorum - University of Bologna
Master's Degree
in
Biomedical Engineering

Parallelization of the Algorithm
WHAM
with
NVIDIA CUDA
Supervisor: Prof. Stefano Severi
Co-Supervisor: Ing. Simone Furini

NVIDIA Research

Presented by Nicolò Savioli
Academic year 2012/2013

Free-Energy:

 
Fi = mai i = 1,..., N
i

The aim of this thesis is to implement
the WHAM algorithm, originally
implemented in CPU, for execution in
GPU graphic cards. WHAM is an
algorithm to estimate free energy
profiles from Molecular Dynamics
simulation.
Free energy estimates can be used to
ΔA=A −A
identify the affinity between
molecules (Pharmacological
Research).
1

The difference in Free Energy,
between two configurations, 0 and 1
can be expressed as:

ΔA=A1−A0=log()
2 =log

P1
P0

( )


 
Fi = −∇iV(r1,..., rN )

ΔA=A − A =−k T log ( P / P )
1

© 2008 NVIDIA Corporation

0

B

1

0

Umbrella Sampling
(Torrie and Valleau ,1977)
The problem is that Molecular Dynamics trajectories are limited
in time (blocked in local minima of energy).
Biasing potential can be used to force the system to explore new
configurations.
In Umbrella Sampling several simulations with different biasing
potentials are run to explore the configuration space.
ξ ( r 3N )

Ion channel

0 2

W i ( ξ ) = k / 2 ( ξ −ξ 1 )

•

Ion

H ( Γ ) =H +W ( ξ )
i

Biased Hamiltonian

0

i

Unbiased Hamiltonian

+ Biasing Potential

Weighted Histogram Analysis Method
(WHAM)
Our aim is calculate the properties of the original system (Unbiased) using the
trajectories of biased simulations.
In the WHAM algorithm the probability of the unbiased system is calculated as
a linear combination of R estimates obtained from R independent trajectories.
Minimization of the variance of the unbiased probability gives the following set
of equations:
a) It starts with an arbitrary set
of fi.
b) It use the first equation to
calculate P(ξh).
c) It use second equation to
update fi.

Number of samples inside bin h
R

P (ξ h )=∑ (
u

i=1

ni / 2 τ( ξh )
(n j / 2 τ j (ξ h ))e

u

f i =−(1 /β) log( ∑ P ( ξh ) e
h


b

−β (W j (ξh )− f j )

−βW i (ξh )

)

) P i (ξ h )

Integrated autocorrelation time

Why GPU?
In recent years, new computational models have been developed in which new
parallel architectures have allowed the improvement of computational abilities
allowing numerical simulations to be more efficient and quicker.
One of the strategies used to parallelize mathematical model is the use of
GPGPU (General Propose Computing on Graphics Processing Unit).
It was originally develop in image processing and now is also used in scientific
simulations.
In recent years the computational capability of these architecture is increasing
exponentially in comparison with CPU, and from 2007 NVIDIA has opened the
possibility of programming GPUs with a specific language called CUDA.
•


GPU Architecture:
The model of NVIDIA GPUs is SMID (Single Instruction,
Multiple Data) composed of only a control unit that
executes one instruction at a time by controlling more
ALU that works in a synchronous manner.

GPUs is
constituted by a
number of
Multiprocessors
(SM)

The GPU is connected to a host through a
PCI-Express.

8 or 16 Stream
Processors (SP):
(floating
point,integer logic
unit.)
Registers,
Execution
Pipelines,
Chaches.

Texture Memory
implanting a
texture 2D of
polygonal model.

Global Memory
from 256MB to 6GB
with Bandwidth
150 GB/s


Shared Memory (32KB)
but fast !!!

Example Code

// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
//i) index that runs every thread to block
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

// Host code
int main()
{
int N = ...;
size_t size = N * sizeof(float);
//a) Allocate input vectors h_A and h_B in host memory
float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
// Initialize input vectors
...
//b) Allocate vectors in device memory
float* d_A;
cudaMalloc(&d_A, size);
float* d_B;
cudaMalloc(&d_B, size);
float* d_C;
cudaMalloc(&d_C, size);
//c) Copy vectors from host memory to device memory
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//d) Group of threads are contained in blocks which in turn are
contained in a grid must initialize number blocks for grid and
thread block number
int threadsPerBlock = 256;
int blocksPerGrid =(N + threadsPerBlock - 1) / threadsPerBlock;
//e) Invoke kernel
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
//f) Copy result from device memory to host memory
h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
//g) Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
//h) Free host memory
...

}

CUDA WHAM Considerations:
The code consists of 11 files invoked as external functions and of a main
file that initializes variables and execute the iterative algorithm.
The C++ function clock() was used to temporize the code.
Optimizations have been made:
The Costant Memory was used to store the variables used more often.

In order to optimize the process of sums we used a Cuda technique called sum
reduction. Each thread of block is synchronized and it produces a single result
that is shared with another through Shared Memory.

__syncthreads()


Organization of the code:
//invocation of the external CUDA function for Calculating Bias
Bias(HIST.numhist, HIST.numwin,HIST.numdim,dev numhist,dev numdim,dev histmin,dev
center, dev harmrest, dev delta,dev step,dev numbin,dev U,dev numwham);

0 2
k
W i ( ξ ) = ( ξ −ξ 1 )
2

R

P (ξ h )=∑ (
u

i=1

ni / 2 τ( ξh )
(n j / 2 τ j (ξ h ))e

b

−β (W j (ξh )− f j )

1
u
−β W (ξ )
f i =−( )log (∑ P (ξ h )e
)
β
h
i

NF = ∑ P ( ξh )
u

h

P u (ξ h)=∑ P u (ξh )/ NF
h

f i = f i +log( NF )
u

u

2

Conv=( P n [i ]− P o [i])

h

) P i (ξ h )

while((it < numit)&&(!converged)){
//invocation of the external CUDA function for Calculating P (New probability)
NewProbabilities(cpu numhist[0],cpu numwin[0],dev numhist,dev numwin,
dev numbinwin,dev g,dev numwham,dev U,dev F,dev denwham,dev Punnorm result);
//invocation of the external CUDA function for Calculating new Sum
summationP (cpu numhist[0],cpu numwin[0],
dev numhist,dev numwin,dev U,dev UU,dev numwham);
NewSum (dev numhist,cpu numwin[0],dev sumP,dev UU,dev Punnorm result,dev
numwham);
//invocation of the external CUDA function for Calculating new constant F
NewConstants(cpu numhist[0],cpu numwin[0],dev U,dev Punnorm result,
dev sumP,dev F,dev numwham);
//invocation of the external CUDA function for Calculating Normalization Constant
NormFactor(cpu numhist[0],dev Punnorm result,
sum normfactor for normprob and normcoef,dev numwham);
//invocation of the external CUDA function for Normalization of P
NormProbabilities (cpu numhist[0],dev sum normfactor for normprob and normcoef,
Punnorm result,dev P,dev numwham);
//invocation of the external CUDA function for Normalization of F
NormCoefficient(cpu numwin[0],dev sum normfactor for normprob and normcoef
,dev F,dev sumP);
//invocation of the external CUDA function for Convergence of the Math Model
CheckConvergence(cpu numhist[0],dev P,dev P old,HIST.numgood,
rmsd result,dev numwham);

A ( ξ )=−k B T log ( P ( ξ ) )


//invocation of the external CUDA function for Calculating Free Energy
ComputeEnergy(cpu numhist[0],dev P,dev kT,dev A result,dev P old,dev denwham);
cudaMemcpy(cpu rmsd result,dev rmsd result,sizeof (float),cudaMemcpyDeviceToHost);
if (cpu rmsd result[0] < tol)
converged = true;//Is it converged ?
it++;
}

Architectures used:
GPU WHAM was tested in different GPU architectures and compared
with the corresponding CPU WHAM.
GT 9500 with Compute Capability of 1.1 (32 CUDA cores)
GT 320M with Compute Capability of 1.0 (24 CUDA cores)
Athlon X2 64 Dual Core
Intel i5 3400 Quad Core


Analysis of Convergence

GT 9500 (32 CUDA Cores)

GT 320M (24 CUDA Cores)

KJ/mol

They reach the same point of
convergence !!!

Time [s]

Performance:
Performances almost double from compute capability 1.0 to compute capability 1.1.

GT 9500 (32 CUDA Cores)

Time [s]

GT 320M (24 CUDA Cores)

MORE POWER !!!


Number of Iterations

Ratio with variable grid:

GPU/CPU Time [s]

Constant with
increasing size of
the grid: there are
no traffic problems
with memory !!!


Number of Dim Grid

Conclusions:

For the first time the WHAM algorithm has been implemented in GPU.
The speed of execution of the GPU-WHAM algorithm increases with the speed of the
graphics card used.
The GPU/CPU speed ratio is constant when changing the size of grid.
GPU-WHAM can execute in parallel with CPU calculations increasing the speed of
execution.


Thank you for your attention!


Slide tesi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Slide tesi

Similar to Slide tesi (20)

Recently uploaded

Recently uploaded (20)

Slide tesi