In this article we compare the results obtained with an implementation of the Finite Volume for structured meshes on GPGPUs with experimental results and also with a Finite Element code with boundary fitted strategy. The example is a fully submerged spherical buoy immersed in a cubic water recipient. The recipient undergoes an harmonic linear motion imposed with a shake table. The experiment is recorded with a high speed camera and the displacement of the buoy if obtained from the video with a MoCap (Motion Capture) algorithm. The amplitude and phase of the resulting motion allows to determine indirectly the added mass and drag of the sphere.
Similaire à Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling Fluid Structure Interaction for a Submerged Spherical Buoy (20)
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling Fluid Structure Interaction for a Submerged Spherical Buoy
1. Advances in NS solver on GPU por M.Storti et.al.
Advances in the Solution of NS Eqs. in GPGPU
Hardware. Modelling Fluid Structure Interaction
for a Submerged Spherical Buoy
by Mario Stortiab, Santiago Costarelliab, Luciano Garellia,
Marcela Cruchagac, Ronald Ausensic, S. Idelsohnabde
aCentro de Investigaci ´on de M´etodos Computacionales, CIMEC (CONICET-UNL)
Santa Fe, Argentina, mario.storti@gmail.com, www.cimec.org.ar/mstorti
bFacultad de Ingenier´ıa y Ciencias H´ıdricas. UN Litoral. Santa Fe, Argentina, fich.unl.edu.ar
cDepto Ing. Mec´anica, Univ Santiago de Chile
dInternational Center for Numerical Methods in Engineering (CIMNE)
Technical University of Catalonia (UPC), www.cimne.com
eInstituci ´o Catalana de Recerca i Estudis Avanc¸ats
(ICREA), Barcelona, Spain, www.icrea.cat
CIMEC-CONICET-UNL 1
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
2. Advances in NS solver on GPU por M.Storti et.al.
Scientific computing on GPUs
Graphics Processing Units (GPU’s) are specialized hardware desgined to
discharge computation from the CPU for intensive graphics applications.
They have many cores (thread processors), currently the Tesla K40
(Kepler GK110) has 2880 cores at 745 Mhz (Builtin boost to 810, 875Mhz).
The raw computing power
is in the order of Teraflops
(4.3 Tflops in SP and
1.43 Tflops in DP).
Memory Bandwidth
(GDDR5) 288 GB/sec.
Memory size 12 GB.
Cost USD 5,000. Low end
version Geforce GTX Titan:
USD 1000.
CIMEC-CONICET-UNL 2
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
3. Advances in NS solver on GPU por M.Storti et.al.
Scientific computing on GPUs (cont.)
The difference between the GPUs
architecture and standard
multicore processors is that GPUs
have much more computing units
(ALU’s (Arithmetic-Logic Unit) and
SFU’s (Special Function Unit), but
few control units.
The programming model is SIMD
(Single Instruction Multiple Data).
GPUs compete with many-core
processors (e.g. Intel’s Xeon Phi)
Knights-Corner, Xeon-Phi 60
cores).
if (COND) {
BODY-TRUE;
} else {
BODY-FALSE;
}
CIMEC-CONICET-UNL 3
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
4. Advances in NS solver on GPU por M.Storti et.al.
Xeon Phi
In December 2012 Intel launched the Xeon Phi coprocessor card: 3100 and
5110P. (2000 USD to 2600 USD). It has 60 cores with 22nm technology
(clock speed 1GHz approx). “Supercomputer on a card” (SOC).
Today limitation is that(with 22nm technology) is that 5e9 transistors can
be put on a sinle chip. Today Xeon processors have typically 2.5e9
transistors.
Xeon Phi has 60 cores equivalent to the original Pentium processor (40e6
transistors).
CIMEC-CONICET-UNL 4
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
5. Advances in NS solver on GPU por M.Storti et.al.
Xeon Phi (cont.)
Xeon Phi is an alien computer. It fits in a PCI Express X 16 slot, and has its
own basic Linux system. You can SSH to the card and run x86-64 code.
Another workflow is to run the code in the host and send intensive
computing tasks to the card (e.g. solving a linear system).
On January 2013 Texas Advanced Computing Center (TACC) added Xeon
Phi’s to his Stampede supercomputer. Main CPUs are Xeon E5-2680. 128
nodes have Nvidia Kepler K20 GPUs. Estimated performance 7.6 Pflops.
Tianhe-2 (China) the current fastes supercomputer (33.86 Pflops) includes
also Xeon Phi coprocessors.
Part of Intel’s Many Integrated Core (MIC) architecture. Previous
codenames for the project: Larrabee, Knights Ferry, Knights-Corner.)
CIMEC-CONICET-UNL 5
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
6. Advances in NS solver on GPU por M.Storti et.al.
We want to use GPUs... for what?
Aerodynamics of a racing car
Aerodynamics of an
artillery projectile Flow patten in a harbour
Dust particles in a coke
processing facility
Thermo fluid coupling
Aerodynamics of a
falling object
Flutter at Ma=2.3
(launch video showall)
CIMEC-CONICET-UNL 6
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
7. Advances in NS solver on GPU por M.Storti et.al.
GPUs in HPC
Some HPC people are skeptical
about the efficient computing
power of GPUs for scientific
applications.
In many works speedup is referred
to available CPU processors, which
is not consistent.
Delivered speedup w.r.t.
mainstream x86 processors is
often much lower than expected.
Strict data parallelism is difficult to
achieve on CFD applications.
Unfortunately, this idea is
reinforced by the fact that GPUs
come from the videogame special
effects industry, not with scientific
computing.
CIMEC-CONICET-UNL 7
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
8. Advances in NS solver on GPU por M.Storti et.al.
Solution of incompressible Navier-Stokes flows on GPU
GPU’s are less efficient for algorithms that require access to the card’s
(device) global memory. Shared memory is much faster but usually scarce
(16K per thread block in the Tesla C1060) .
The best algorithms are those that make computations for one cell
requiring only information on that cell and their neighbors. These
algorithms are classified as cellular automata (CA).
Lattice-Boltzmann and explicit F?M (FDM/FVM/FEM) fall in this category.
Structured meshes require less data to exchange between cells (e.g.
neighbor indices are computed, no stored), and so, they require less
shared memory. Also, very fast solvers like FFT-based (Fast Fourier
Transform) or Geometric Multigrid are available .
CIMEC-CONICET-UNL 8
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
9. Advances in NS solver on GPU por M.Storti et.al.
Fractional Step Method on structured grids with QUICK
Proposed by Molemaker et.al. SCA’08: 2008 ACM SIGGRAPH, Low viscosity
flow simulations for animation.
Fractional Step Method
(a.k.a. pressure
segregation)
u; v;w and continuity
cells are staggered
(MAC=Marker And
Cell).
QUICK advection
scheme is used in the
predictor stage.
Poisson system is
solved with IOP
(Iterated Orthogonal
Projection) (to be
described later), on top
of Geometric MultiGrid
j+2
j+1
j
Uij Pij
Vij
(ih,jh) i+1 i+2
x
y
i
y-momentum cell
x-momentum nodes
y-momentum nodes
x-momentum cell continuity nodes
continuity cell
CIMEC-CONICET-UNL 9
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
10. Advances in NS solver on GPU por M.Storti et.al.
Solution of the Poisson with FFT
Solution of the Poisson equation is, for large meshes, the more CPU
consuming time stage in Fractional-Step like Navier-Stokes solvers.
We have to solve a linear system Ax = b
The Discrete Fourier Transform (DFT) is an orthogonal transformation
^x = Ox = t(x).
The inverse transformation O1 = OT is the inverse Fourier Transform
x = OT ^x = it(^x).
If the operator matrix A is spatially invariant (i.e. the stencil is the same at
all grid points) and the b.c.’s are periodic, then it can be shown that O
diagonalizes A, i.e. OAO1 = D.
So in the transformed basis the system of equations is diagonal
(OAO1) (Ox) = (Ob);
D^x = ^ b;
(1)
For N = 2p the Fast Fourier Transform (FFT) is an algorithm that
computes the DFT (and its inverse) in O(N log(N)) operations.
CIMEC-CONICET-UNL 10
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
11. Advances in NS solver on GPU por M.Storti et.al.
Solution of the Poisson with FFT (cont.)
So the following algorithm computes the solution of the system in
O(N log(N)) ops.
. ^b
= t(b); (transform r.h.s)
. ^x = D1^ b, (solve diagonal system O(N))
. x = it(^x), (anti-transform to get the sol. vector)
Total cost: 2 FFT’s, plus one element-by-element vector multiply (the
reciprocals of the values of the diagonal of D are precomputed)
In order to precompute the diagonal values of D,
. We take any vector z and compute y = Az,
. then transform ^z = t(z), ^y = t(y),
. Djj = ^yj=^zj .
CIMEC-CONICET-UNL 11
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
12. Advances in NS solver on GPU por M.Storti et.al.
Solution of the Poisson equation on embedded geometries
FFT solver and GMG are very fast but have several restrictions: invariance
of translation, periodic boundary conditions. They are not well suited for
embedded geometries.
One approach for the solution is the IOP (Iterated Orthogonal Projection)
algorithm.
It is based on solving iteratively the Poisson eq. on the whole domain
(fluid+solid). Solving in the whole domain is fast, because algorithms like
Geometric Multigrid or FFT can be used. Also, they are very efficient
running on GPU’s .
However, if we solve in the whole domain, then we can’t enforce the
boundary condition (@p=@n) = 0 at the solid boundary which, then
means the violation of the condition of impenetrability at the solid
boundary .
CIMEC-CONICET-UNL 12
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
13. Advances in NS solver on GPU por M.Storti et.al.
The IOP (Iterated Orthogonal Projection) method
The method is based on succesively solve for the incompressibility condition
(on the whole domain: solid+fluid), and impose the boundary condition.
on the whole
domain (fluid+solid)
violates impenetrability b.c. satisfies impenetrability b.c.
CIMEC-CONICET-UNL 13
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
14. Advances in NS solver on GPU por M.Storti et.al.
The IOP (Iterated Orthogonal Projection) method (cont.)
Fixed point iteration
wk+1 = bdydivwk:
Projection on the space of
divergence-free velocity fields:
u0 = div(u)
(
u0 = u rP;
P = r u;
Projection on the space of velocity
fields that satisfy the
impenetrability boundary condition
u00 = bdy(u0)
(
u00 = ubdy; in
bdy;
u00 = u0; in
uid:
CIMEC-CONICET-UNL 14
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
15. Advances in NS solver on GPU por M.Storti et.al.
Implementation details on the GPU
We use the CUFFT
library.
Per iteration: 2 FFT’s
and Poisson residual
evaluation. The FFT on
the GPU Tesla C1060
performs at 27 Gflops,
(in double precision)
where the operations
are counted as
5N log2(N).
30
25
20
15
10
5
0
10^{3} 10^{4} 10^{5} 10^{6} 10^{7} 10^{8}
FFT computing rate [Gflops]
Vector size
CIMEC-CONICET-UNL 15
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
16. Advances in NS solver on GPU por M.Storti et.al.
FFT computing rates in GPGPU. GTX-580
250
200
150
100
50
0
128x64x64
simple precision
double precision
105 106 107
FFT proc. rate [Gflops/sec]
Ncell
64x64x64
128x128x64
128x128x128
256x128x128 256x256x128
CIMEC-CONICET-UNL 16
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
17. Advances in NS solver on GPU por M.Storti et.al.
FFTW on i7-3820@3.60Ghz (Sandy Bridge)
25000
20000
FFT proc. rate [Mflops] nthreads=1
15000
10000
5000
0
nthreads=4
nthreads=2
103 104 105 106 107 108 109
Ncell
CIMEC-CONICET-UNL 17
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
18. Advances in NS solver on GPU por M.Storti et.al.
NSFVM Computing rates in GPGPU. Scaling
160
140
120
100
80
60
40
20
C2050-SP
10-1 100 101 102 # of cells [Mcell]
rate [Mcell/sec]
GTX-580 SP
GTX-580 DP
C2050-DP
64x64x64
128x64x64
128x128x64
128x128x128
256x128x128
256x256x128
256x256x256
CIMEC-CONICET-UNL 18
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
19. Advances in NS solver on GPU por M.Storti et.al.
NSFVM and “Real Time” computing
For a 128x128x128 mesh ( 2Mcell), we have a computing time of
2 Mcell/(140 Mcell/sec) = 0.014 secs/time step.
That means 70 steps/sec.
A von Neumann stability analysis shows that the QUICK stabilization
scheme is inconditionally stable if advanced in time with Forward Euler.
With a second order Adams-Bashfort scheme the critical CFL is 0.588.
For NS eqs. the critical CFL has been found to be somewhat lower ( 0.5).
If L = 1, u = 1, h = 1=128, t = 0:5h=u = 0:004 [sec], so that we can
compute in 1 sec, 0.28 secs of simulation time. We say ST/RT=0.28.
In the GTX Titan Black we expect to run 3.5x times faster, and has the
double of memory (6 GB versus 3 GB). We expect that it would allow run
with up to 12 Mcells (currently 6 Mcells in the GTX 580).
(launch video nsfvm-bodies-all),
CIMEC-CONICET-UNL 19
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
20. Advances in NS solver on GPU por M.Storti et.al.
NSFVM and “Real Time” computing (cont.)
CIMEC-CONICET-UNL 20
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
21. Advances in NS solver on GPU por M.Storti et.al.
Computing times in GPGPU. Fractional Step components
50
40
30
20
10
0
predictor comp-div CG-FFT-solver corrector
time participation in total computing time [%]
128x128x128 mesh
256x256x128 mesh
GTX 580 (Simple precision)
CIMEC-CONICET-UNL 21
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
22. Advances in NS solver on GPU por M.Storti et.al.
Current work
Current work is done in the following directions
Improving performance by replacing the QUICK advection scheme by
MOC+BFECC (which could be more GPU-friendly).
Implementing a CPU-based renormalization algorithm for free surface
(level-set) flows.
Another important issue is improving the representation (accuracy) of the
solid body surface by using an immersed boundary technique.
CIMEC-CONICET-UNL 22
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
23. Advances in NS solver on GPU por M.Storti et.al.
Why leave QUICK?
One of steps of the Fractional Steps Method is the advection step. We
have to advect the velocity field and we desire a method as less diffusive
as possible, and that allows as large the CFL number as possible.
Also, of course, we want a GPU friendly algorithm.
Previously we used QUICK, but it has a stencil that extends more than one
cell in the upwind direction. This increases shared memory usage and
data transfer. We seek for another low dissipation scheme with a more
compact stencil.
CIMEC-CONICET-UNL 23
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
24. Advances in NS solver on GPU por M.Storti et.al.
Quick advection scheme
1D Scalar advection diffusion: a= advection velocity, advected scalar.
@
@x
25.
26.
27.
28. (a)
i+1=2
(aQ)i+1 (aQ)i
x
;
Q
i =
8 :
3=8i+1=2
+ 6=8i1=2
1=8i3=2
; if a 0;
3=8ui1=2
+ 6=8ui+1=2
1=8ui+3=2
; if a 0;
x
control volume cell
i-3/2 i-1/2 i+1/2 i+3/2 i+5/2
i i+1
CIMEC-CONICET-UNL 24
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
29. Advances in NS solver on GPU por M.Storti et.al.
Method Of Characteristics (MOC)
The Method Of Characteristics (MOC) consists in tracking the position of
the node following the characteristics to the position it had at time tn and
taking its value there,
(xn+1; tn+1) = (xn; tn)
If xn doesn’t happen to be a mesh node it involves a projection.
It’s the basis of the Lagrangian methods for dealing with advection terms.
CIMEC-CONICET-UNL 25
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
30. Advances in NS solver on GPU por M.Storti et.al.
Method Of Characteristics (MOC) (cont.)
So typically MOC has very low diffusion if CFL is an integer number ,
and too diffusive if it is an semi-integer number .
Of course, in the general case (non uniform meshes, non uniform velocity
field) we can’t manage to have an integer CFL number for all the nodes.
(launch video video-moc-cfl1), (launch video video-moc-cfl05).
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
u(t=0)
0 1 2 3 4 CFL 5
max(u)
U=1
u(t=10)
L=1
periodic
MOC
BFECC+MOC
CIMEC-CONICET-UNL 26
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
31. Advances in NS solver on GPU por M.Storti et.al.
MOC+BFECC
Assume we have a low order (dissipative) operator (may be SUPG, MOC,
or any other) t+t = L(t; u).
The Back and Forth Error Compensation and Correction (BFECC) allows
to eliminate the dissipation error.
. Advance forward the state t+t; = L(t; u).
. Advance backwards the state t; = L(t+t;;u).
exact w/dissipation error
CIMEC-CONICET-UNL 27
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
32. Advances in NS solver on GPU por M.Storti et.al.
MOC+BFECC (cont.)
exact w/dissipation error
If L introduces some dissipative error , then t;6= t, in fact
t; = t + 2.
So that we can compensate for the error:
t+t = L(t;t) ;
= t+t; 1=2(t; t)
(2)
CIMEC-CONICET-UNL 28
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
33. Advances in NS solver on GPU por M.Storti et.al.
MOC+BFECC (cont.)
(launch video video-moc-bfecc-cfl05).
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
u(t=0)
0 1 2 3 4 CFL 5
max(u)
U=1
u(t=10)
L=1
periodic
MOC
BFECC+MOC
CIMEC-CONICET-UNL 29
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
34. Advances in NS solver on GPU por M.Storti et.al.
MOC+BFECC (cont.)
Nbr of Cells QUICK-SP BFECC-SP QUICK-DP BFECC-DP
64 64 64 29.09 12.38 15.9 5.23
128 128 128 75.74 18.00 28.6 7.29
192 192 192 78.32 17.81 30.3 7.52
Cubic cavity. Computing rates for the whole NS solver (one step) in [Mcell/sec] obtained with the
BFECC and QUICK algorithms on a NVIDIA GTX 580. 3 Poisson iterations were used.
[jump to Conclusions]
CIMEC-CONICET-UNL 30
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
35. Advances in NS solver on GPU por M.Storti et.al.
Analysis of performance
Regarding the performance results shown above, it can be seen that the
computing rate of QUICK is at most 4x faster than that of BFECC. So
BFECC is more efficient than QUICK whenever used with CFL 2, being
the critical CFL for QUICK 0.5. The CFL used in our simulations is typically
CFL 5 and, thus, at this CFL the BFECC version runs 2.5 times faster
than the QUICK version.
The speedup of MOC+BFECC versus QUICK increases with the number of
Poisson iterations. In the limit of very large number of iters (very low
tolerance in the tolerance for Poisson) we expect a speedup 10x (equal to
the CFL ratio).
CIMEC-CONICET-UNL 31
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
36. Advances in NS solver on GPU por M.Storti et.al.
Validation. Lid driven 3D cubic cavity
Re=1000, mesh of 128 128 128 (2 Mcell). Results compared with Ku
et.al (JCP 70(42):439-462 (1987)).
More validation and complete performance study at Costarelli et.al,
Cluster Computing (2013), DOI:10.1007/s10586-013-0329-9.
1.2
0.88
0.56
0.24
−0.08
−0.4
0 0.2 0.4 0.6 0.8 1
y
w(y)
−0.5 −0.34 −0.18 −0.02 0.14 0.3
1
0.8
0.6
0.4
0.2
0
v(z)
z
Ku et al.
NSFDM
CIMEC-CONICET-UNL 32
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
37. Advances in NS solver on GPU por M.Storti et.al.
Renormalization
Even with a high precision, low dissipative algorithm for transporting the level
set function we have to renormalize ! 0 with a certain frequency the
level set function.
Requirements on the renormalization
algorithm are:
. 0 must preserve as much as posible
the 0 level set function (interface) .
. 0 must be as regular as possible near
the interface.
. 0 must have a high slope near the
interface.
. Usually the signed distance function is
used, i.e.
0(x) = sign((x)) min
y2
Φ (renormalized)
jjy xjj Φ x
Γ
Γ Γ
CIMEC-CONICET-UNL 33
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
38. Advances in NS solver on GPU por M.Storti et.al.
Renormalization (cont.)
Computing plainly the distance
function is O(NN) where N is the
number of points on the interface. This
scales typically / N1+(nd1)=nd
(N5=3 in 3D).
Many variants are based in solving the
Eikonal equation
jrj = 1;
As it is an hyperbolic equation it can
be solved by a marching technique.
The algorithm traverses the domain
with an advancing front starting from
the level set.
However, it can develop caustics
(shocks), and rarefaction waves. So,
an entropy condition must be enforced.
caustic
expansion fan
CIMEC-CONICET-UNL 34
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
39. Advances in NS solver on GPU por M.Storti et.al.
Renormalization (cont.)
The Fast Marching algorithm
proposed by Sethian (Proc Nat
Acad Sci 93(4):1591-1595 (1996)) ,
is a fast (near optimal) algorithm
based on Dijkstra’s algorithm for
computing minimum distances in
graphs from a source set. (Note:
the original Dijkstra’s algorithm is
O(N2), not fast. The fast version
using a priority queue is due to
Fredman and Tarjan (ACM Journal
24(3):596-615, 1987), and the
complexity is
O(N log(jQj)) O(N log(N))).
Level set
Q=advancing front F=far-away
CIMEC-CONICET-UNL 35
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
40. Advances in NS solver on GPU por M.Storti et.al.
The Fast Marching algorithm
We explain for the positive part 0.
Then the algorithm is reversed for 0.
All nodes are in either: Q=advancing front,
F=far-away , I=frozen/inactive. The
advancing front sweeps the domain
starting at the level set and converts F
nodes to I.
X
Initially Q = fnodes that are in contact
with the level setg. Their distance to the
interface is computed for each cut-cell.
The rest is in F =far-away.
loop: Take the node X in Q closest to the
interface. Move it from Q ! I.
Level set
Update all distances from neighbors to X
and move them from F ! Q.
Go to loop.
Algorithm ends when Q = ;. Q=advancing front F=far-away
I= frozen/inactive
CIMEC-CONICET-UNL 36
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
41. Advances in NS solver on GPU por M.Storti et.al.
FastMarch: error and regularity of the distance function
Numerical example shows
regularity of computed distance
function in a mesh of 100x100.
We have a LS consisting of a circle
R = 0:2 inside a square of L = 1.
is shown along the x = 0:6 cut
of the geometry, also we show the
first and second derivatives.
deviates less than 103 from the
analytical distance.
Small spikes are observed in the
second derivative.
The error ex shows the
discontinuity in the slope at the LS.
0.1
0
-0.1
-0.2
-0.3
-0.4
X
R=0.2
Φ
0 0.2 0.4 0.6 0.8 1
1
0.5
0
-0.5
-1
dΦ/dx
L=1
0 0.2 0.4 0.6 0.8 1
0
-5
-10
-15
-20
d2Φ/dx2
0 0.2 0.4 0.6 0.8 1
0
-0.0002
-0.0004
-0.0006
-0.0008
-0.001
|Φ-Φex|
0 0.2 0.4 0.6 0.8 1
CIMEC-CONICET-UNL 37
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
42. Advances in NS solver on GPU por M.Storti et.al.
FastMarch: implementation details
Complexity is O(N) the cost of finding the node in Q closest to the
level set.
This can be implemented in a very efficient way with a priority queue
implemented in top of a heap. In this way finding the closest node is
O(log jQj). So the total cost is
O(N log jQj) O(N log(N
(nd1)=nd )) = O(N logN2=3) (in 3D).
The standard C++ class priority_queue is not appropriate
because don’t give access to the elements in the queue.
We implemented the heap structure on top of a vector and an
unordered_map (hash-table based) that tracks the Q-nodes in the
structure. The hash function used is very simple.
CIMEC-CONICET-UNL 38
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
43. Advances in NS solver on GPU por M.Storti et.al.
FastMarch renorm: Efficiency
The Fast Marching algorithm is
O(N log jQj) where N is the
number of cells and jQj the size of
the advancing front.
Rates were evaluated in an Intel
i7-950@3.07 (Nehalem).
Computing rate is practically
constant and even decreases with
high N.
Since the rate for the NS-FVM
algorithm is 100 [Mcell/s],
renormalization at a frequency
greater than 1/200 steps would be
too expensive.
Cost of renormalization step is
reduced with band renormalization
and parallelism (SMP).
N (Nbr of cells per side)
processing rate [Mcells/s]
2048x2048
1024x1024
512x512
256x256
128x128
64x64
32x32
0.6
0.5
0.4
0.3
0.2
0.1
0
103 104 105 106 107
CIMEC-CONICET-UNL 39
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
44. Advances in NS solver on GPU por M.Storti et.al.
FastMarch renorm: band renormalization
The renormalization algorithm
doesn’t need to cover the whole
domain. Only a band around the
level set (interface) is needed.
The algorithm is modified simply:
set distance in far-away nodes to
d = dmax.
Cost is proportional to the volume
of the band, i.e.:
Vband = Sband 2dmax / dmax.
Low dmax reduces cost, but
increases the probability of forcing
a new renormalization, and thus
increasing the renormalization
frequency.
Φ (renormalized w/o dmax)
x Φold
dmax
dmax
Φ (renormalized w/ dmax)
width of renormalization
band = 2 dmax
CIMEC-CONICET-UNL 40
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
45. Advances in NS solver on GPU por M.Storti et.al.
FastMarch renorm: Parallelization
How to parallelize FastMarch? We can
do speculative parallelism that is while
processing a node X at the top of the
heap, we can process in parallel the
following node Y , speculating that
most of the time node Y will be far
from X and then can be processed
independently. This can be checked
afterwards, using time-stamps for
instance.
23 16 10 19 21
7 5
Level set
0
1
Y
2
3
X
4 6
12
8
17
14
9
13
20 18 11 15 22
All nodes of the same color can be
processed at the same time.
CIMEC-CONICET-UNL 41
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
46. Advances in NS solver on GPU por M.Storti et.al.
FastMarch renorm: Parallelization (cont.)
How much nodes can be
processed concurrently? It
turns out that the
simultaneity (number of
nodes that can be
processed simultaneously)
grows linearly with
refinement.
Average simultaneity is
16x16: 11.358
32x32: 20.507
Percentage of times
simultaneity is 4:
16x16: 93.0%
32x32: 98.0%
80
70
60
50
40
30
20
10
0
0 0.2 0.4 0.6 0.8 1
numer of independent points
% advance front
80
70
60
50
40
30
20
10
0
0 0.2 0.4 0.6 0.8 1
% front nodes
numer of independent points
CIMEC-CONICET-UNL 42
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
47. Advances in NS solver on GPU por M.Storti et.al.
FastMarching: computational budget
With band renormalization and SMP parallelization we expect a rate of
20 Mcell/s.
That means that a 1283 mesh (2 Mcell) can be done in 100 ms.
This is 7x times the time required for one time step (14 ms).
Renormalization will be amortized if the renormalization frequency is more
than 1/20 time steps.
Transfer of the data to and from the processor through the PCI Express 2.0
x 16 channel (4 GB/s transfer rate) is in the order of 10 ms.
BTW: note that transfers from the CPU to/from the card are amortized if
they are performed each 1:10 steps or so. Such transfers can’t be done all
time steps.
CIMEC-CONICET-UNL 43
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
48. Advances in NS solver on GPU por M.Storti et.al.
Fluid structure interaction
In the previous examples we have shown several cases of rigid solid
bodies moving immersed in a fluid. In those cases the position of the
body is prescribed beforehand.
A more complex case is when the position of the body results from the
interaction with the fluid.
In general if the body is not rigid we have to solve the elasticity equations
in the body
bdy(t). We assume here that the body is rigid, so that we
just solve for the 6 d.o.f. of the rigid with the linear and angular
momentum of the body.
CIMEC-CONICET-UNL 44
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
49. Advances in NS solver on GPU por M.Storti et.al.
Fluid structure in the real world
Three Gorges (PR China) dam
Francis Turbine
Belomone Francis Turbine
installation
Sayano–Shushenskaya (Russia) Dam
Machine room prior to 2009 accident
Sayano–Shushenskaya (Russia) Dam
Machine room after 2009 accident
IMPSA (Industria Metal ´ urgica Pescarmona) is manufacturing three Francis
turbines for the Belo Monte (Brazil) dam. Once in operation it would be the
3rd largest dam in the world.
Ignition of a rocket engine (launch video video-simulacion-motor-cohete).
CIMEC-CONICET-UNL 45
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
50. Advances in NS solver on GPU por M.Storti et.al.
Fluid structure in the real world (cont.)
CIMEC computed the rotordynamic
coefficients, that is a reduced parameter
representation of the fluid in the gap
between stator and rotor. (launch video
movecyl-tadpoles)
The rotordynamic coefficientes represent
the forces of the fluid on to the rotor as the
rotor orbits around its centered position.
They are used as a component of the
design of the structural part of the turbine.
They represent the added mass, damping,
and stiffness (a.k.a. Lomakin effect).
The geometry is very complex: a circular
corona of diameter 8m, gap 3mm, and
height 2m.
CIMEC-CONICET-UNL 46
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
51. Advances in NS solver on GPU por M.Storti et.al.
A buoyant fully submerged sphere
Fluid is water
uid = 1000 [kg=m3].
The body is a smooth sphere D =10cm, fully
submerged in a box of L =39cm. Solid
density is bdy = 366 [kg=m3] 0:4
uid
so the body has positive buoyancy. The buoy
(sphere) is tied by a string to an anchor point
at the center of the bottom side of the box.
Ths box is subject to horizontal harmonic
oscillations
xbox = Abox sin(!t);
CIMEC-CONICET-UNL 47
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
52. Advances in NS solver on GPU por M.Storti et.al.
Spring-mass-damper oscillator
The acceleration of the box creates a driving force on the sphere. Note
that as the sphere is positively buoyant the driving force is in phase with
the displacement. That means that if the box accelerates to the right, then
the sphere tends to move in the same direction. (launch video p2-f09)
The buoyancy of the sphere produces a restoring force that tends to keep
aligned the string with the vertical direction.
The mass of the sphere plus the added mass of the liquid gives inertia.
The drag of the fluid gives a damping effect.
So the system can be approximated as a (nonlinear) spring-mass-damper
oscillator
inertia z }| {
mtotx+
damp z }| {
0:5
CdAjvjx_ +
spring z }| {
(m
otg=L)x =
driving frc z }| {
m
otabox;
mtot = mbdy + madd; m
ot = md mbdy;
madd = Cmaddmd; Cmadd 0:5; md =
Vbdy
CIMEC-CONICET-UNL 48
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
53. Advances in NS solver on GPU por M.Storti et.al.
Experimental results
Experiments carried out atUniv Santiago de
Chile (Dra. Marcela Cruchaga).
Plexiglas box mounted on a shake table,
subject to oscillations of 2cm amplitude in
freqs 0.5-1.5 Hz.
Sphere position is obtained using Motion
Capture (MoCap) from high speed (HS) camera
(800x800 pixels @ 120fps). HS cam is AOS
Q-PRI, can do 3 MPixel @500fps, max frame
rate is 100’000 fps.
MoCap is performed with a home-made code.
Finds current position by minimization of a
functional. Can also detect rotations. In this
exp setup 1 pixel=0.5mm (D=200 px). Camera
distortion and refraction have been evaluated
to be negligible. MoCap resolution can be less
than one pixel. (launch video mocap2)
CIMEC-CONICET-UNL 49
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
54. Advances in NS solver on GPU por M.Storti et.al.
Experimental results (cont.)
180
160
140
120
100
80
60
40
20
0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
phase[deg]
freq[Hz]
anly(CD=0.3,Cmadd=0.55)
exp
0.1
0.08
0.06
0.04
0.02
0
0.4 0.6 0.8 1 1.2 1.4 1.6
Amplitude[m]
freq[Hz]
anly(CD=0.3,Cmadd=0.55)
exp
Best fit of the experimental results with the simple spring-mass-damper
oscillator are obtained for Cmadd = 0:55 and C = d = 0:3. Reference values
are:
Cmadd = 0:5 (potential flow), Cmadd = 0:58 (Neill etal Amer J Phys
2013),
CD=0.47 (reference value). (launch video description-buoy-resoviz)
CIMEC-CONICET-UNL 50
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
55. Advances in NS solver on GPU por M.Storti et.al.
Computational FSI model
The fluid structure interaction is performed with a partitioned Fluid
Structure Interaction (FSI) technique.
Fluid is solved in step [tn; tn+1 = tn + t] between the known position
xn of the body at tn and an extrapolated (predicted) position xn+1;P at
tn+1.
Then fluid forces (pressure and viscous tractions) are computed and the
rigid body equations are solved to compute the true position of the solid
xn+1.
The scheme is second order precision. It can be put as a fixed point
iteration and iterated to convergence (enhancing stability).
The rigid body eqs. are now:
mbdyx + (m
otg=L)x = F
uid + m
otabox
where F
uid = Fdrag + Fadd are now computed with the CFD-NS-FVM
code.
CIMEC-CONICET-UNL 51
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
56. Advances in NS solver on GPU por M.Storti et.al.
Why to use a positive buoyant body?
Gov. equations for the body:
x + (m
otg=L)x =
comp w=CFD z }| {
F
uid(fxg) +m
otabox
where F
uid = Fdrag + Fadd are fluid forces computed with the
CFD-NS-FVM code.
The whole methodology (to compute the response curves amplitude vs.
frequency) and compare with experiments is a means to validate the
precision of the numerical model.
The most important and harder from the point of view of the numerical
method is to compute the the fluid forces (drag and added mass).
Buoyancy is almost trivial and can be easily well captured of factored out
from the simulation.
Then, if the body is too heavy it’s not affected by the fluid forces, so we
prefer a body as lighter as possible.
In the present example mbdy = 0:192[kg], md = 0:523[kg],
madd = 0:288[kg]
CIMEC-CONICET-UNL 52
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
57. Advances in NS solver on GPU por M.Storti et.al.
Numerical results
Results with NS-VFM-GPGPU on GTX-580 3GB RAM. 6 Mcells. Rate 140
Mcell/sec. (launch video fsi-boya-gpu-vorticity), (launch video gpu-corte), (launch video
particles-buoyp5)
Also as a reference PETSc-FEM results included. Boundary fitted mesh
400 k-elements. (launch video fem-mesh-quality)
Vorticity isosurface
velocity at sym plane
tadpoles
FEM mesh distortion
CIMEC-CONICET-UNL 53
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
58. Advances in NS solver on GPU por M.Storti et.al.
Comparison of experimental and numerical results
0.1
0.08
0.06
0.04
0.02
0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
Amplitude[m]
freq[Hz]
exp
anal(CD=0.3,Cmadd=0.55)
NSFVM/GPU
FEM
0
-50
-100
-150
-200
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
Phase[deg]
freq[Hz]
exp
anal(CD=0.3,Cmadd=0.55)
NSFVM/GPU
FEM
CIMEC-CONICET-UNL 54
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
59. Advances in NS solver on GPU por M.Storti et.al.
Conclusions
The NS-FVM implementation reaches high computing rates in GPGPU
hardware (O(140 Mcell/s)).
It can represent complex moving bodies without meshing.
Surface representation of bodies can be made second order (not
implemented yet).
Solution of the Poisson problem is currently a significant part of the
computing time. This is reduced by using the AGP preconditioner and
MOC-BFECC combination.
MOC+BFECC has lower computing rates than QUICK (4x slower) but may
reach CFL=5 (versus CFL=0.5 for QUICK). So we get a speedup of 2.5x.
Speedups may be higher if lower tolerances are required for the Poisson
stage (more Poisson iters).
Fluid-Structure interaction is solved with a partitioned strategy and is
validated against experimental and numerical (FEM) results.
CIMEC-CONICET-UNL 55
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
60. Advances in NS solver on GPU por M.Storti et.al.
Acknowledgments
This work has received financial support from
Consejo Nacional de Investigaciones Cient´ıficas y T´ecnicas (CONICET,
Argentina, PIP 5271/05),
Universidad Nacional del Litoral (UNL, Argentina, grant CAI+D
2009-65/334),
Agencia Nacional de Promoci´on Cient´ıfica y Tecnol ´ ogica (ANPCyT,
Argentina, grants PICT-1506/2006, PICT-1141/2007, PICT-0270/2008), and
European Research Council (ERC) Advanced Grant, Real Time
Computational Mechanics Techniques for Multi-Fluid Problems
(REALTIME, Reference: ERC-2009-AdG, Dir: Dr. Sergio Idelsohn).
The authors made extensive use of Free Software as GNU/Linux OS, GCC/G++
compilers, Octave, and Open Source software as VTK among many others. In
addition, many ideas from these packages have been inspiring to them.
CIMEC-CONICET-UNL 56
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))