SlideShare a Scribd company logo
1 of 47
Download to read offline
Advances in NS solver on GPU por M.Storti et.al.
Advances in the Solution of NS Eqs. in GPGPU
Hardware
by Mario Stortiab
, Santiago Costarelliab
, Luciano Garellia
,
Marcela Cruchagac
, Ronald Ausensic
, S. Idelsohnabde
,
Gilles Perrotf
, RaphaĀØel Couturierf
a
Centro de InvestigaciĀ“on de MĀ“etodos Computacionales, CIMEC (CONICET-UNL)
Santa Fe, Argentina, mario.storti@gmail.com, www.cimec.org.ar/mstorti
b
Facultad de IngenierĀ“ıa y Ciencias HĀ“ıdricas. UN Litoral. Santa Fe, Argentina, ļ¬ch.unl.edu.ar
c
Depto Ing. MecĀ“anica, Univ Santiago de Chile
d
International Center for Numerical Methods in Engineering (CIMNE)
Technical University of Catalonia (UPC), www.cimne.com
e
InstituciĀ“o Catalana de Recerca i Estudis AvancĀøats
(ICREA), Barcelona, Spain, www.icrea.cat
f
FEMTO-ST Institute, Franche-ComtĀ“e Meso-Centre
CIMEC-CONICET-UNL 1
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Scientiļ¬c computing on GPUs
ā€¢ Graphics Processing Units (GPUā€™s) are specialized hardware designed to
discharge computation from the CPU for intensive graphics applications.
ā€¢ They have many cores (thread processors), currently the Tesla K40
(Kepler GK110) has 2880 cores at 745 Mhz (boost 875 Mhz).
ā€¢ The raw computing power
is in the order of Teraļ¬‚ops
(4.3 Tļ¬‚ops in SP and
1.43 Tļ¬‚ops in DP).
ā€¢ Memory Bandwidth
(GDDR5) 288 GiB/s.
Memory size 12 GiB.
ā€¢ Cost: USD 3,100. Low end
version Geforce GTX Titan:
USD 1,500.
CIMEC-CONICET-UNL 2
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Scientiļ¬c computing on GPUs (cont.)
ā€¢ The difference between the GPUs
architecture and standard multicore
processors is that GPUs have much
more computing units (ALUā€™s
(Arithmetic-Logic Unit) and SFUā€™s
(Special Function Unit), but few control
units.
ā€¢ The programming model is SIMD
(Single Instruction Multiple Data).
ā€¢ Branch-divergence happens when a
branch condition gives true in some in
some threads an false in others. Due
to the SIMD model all threads canā€™t be
executed simultaneously.
ā€¢ GPUs compete with many-core
processors (e.g. Intelā€™s Xeon Phi)
Knights-Corner, Xeon-Phi 60 cores).
if (COND) {
BODY-TRUE;
} else {
BODY-FALSE;
}
CIMEC-CONICET-UNL 3
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
GPUs, MIC, and multi-core processors
ā€¢ Computing power is limited by the number
of transistors we can put in a chip. This is
limited in turn to the largest silicon chip
you can grow (O(400 mm2
)), and the
minimum size of the transistors you can
make (O(22 nm)). So actually the limit is in
the order of 5 Ɨ 109
transistors per chip.
ā€¢ So with that number of transistors you can
do one of the following
Have a few (6 to 12) very powerful
cores: Xeon E5-26XX processors.
Have 60 or more less powerful cores,
each one with 40 million transistors,
similar to a Pentium processor: Intel
Many Integrated Core (MIC), Xeon Phi
Have a very large (O(3000)) simple
(without control) cores: Nvidia GPUs.
Die shot of the Nvidia ā€Kepler1ā€
GK104 GPU
CIMEC-CONICET-UNL 4
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Solution of incompressible Navier-Stokes ļ¬‚ows on GPU
ā€¢ GPUā€™s are less efļ¬cient for algorithms that require access to the cardā€™s
(device) global memory. In older cards shared memory is much faster but
usually scarce (16K per thread block in the Tesla C1060) .
ā€¢ The best algorithms are those that make computations for one cell
requiring only information on that cell and their neighbors. These
algorithms are classiļ¬ed as cellular automata (CA).
ā€¢ In newer cards there is a much larger memory cache. Anyway data
locality is fundamental.
ā€¢ Lattice-Boltzmann and explicit F M (FDM/FVM/FEM) fall in this category.
ā€¢ Structured meshes require less data to exchange between cells (e.g.
neighbor indices are computed, no stored), and so, they require less
shared memory. Also, very fast solvers like FFT-based (Fast Fourier
Transform) or Geometric Multigrid are available .
CIMEC-CONICET-UNL 5
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Fractional Step Method on structured grids with QUICK
Proposed by Molemaker et.al. SCAā€™08: 2008 ACM SIGGRAPH, Low viscosity
ļ¬‚ow simulations for animation.
ā€¢ Fractional Step Method
(a.k.a. pressure
segregation)
ā€¢ u, v, w and continuity
cells are staggered
(MAC=Marker And
Cell).
ā€¢ QUICK advection
scheme is used in the
predictor stage.
ā€¢ Poisson system is
solved with IOP
(Iterated Orthogonal
Projection) (to be
described later), on top
of Geometric MultiGrid
j
j+1
j+2
i+1 i+2(ih,jh)
Vij
PijUij
x
y
i
y-momentum cell
x-momentum nodes
y-momentum nodes
continuity nodes
x-momentum cell
continuity cell
CIMEC-CONICET-UNL 6
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Solution of the Poisson with FFT
ā€¢ Solution of the Poisson equation is, for large meshes, the more CPU
consuming time stage in Fractional-Step like Navier-Stokes solvers.
ā€¢ We have to solve a linear system Ax = b
ā€¢ The Discrete Fourier Transform (DFT) is an orthogonal transformation
Ė†x = Ox = ļ¬€t(x).
ā€¢ The inverse transformation Oāˆ’1
= OT
is the inverse Fourier Transform
x = OT Ė†x = iļ¬€t(Ė†x).
ā€¢ If the operator matrix A is spatially invariant (i.e. the stencil is the same at
all grid points) and the b.c.ā€™s are periodic, then it can be shown that O
diagonalizes A, i.e. OAOāˆ’1
= D.
ā€¢ So in the transformed basis the system of equations is diagonal
(OAOāˆ’1
) (Ox) = (Ob),
DĖ†x = Ė†b,
(1)
ā€¢ For N = 2p
the Fast Fourier Transform (FFT) is an algorithm that
computes the DFT (and its inverse) in O(N log(N)) operations.
CIMEC-CONICET-UNL 7
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Solution of the Poisson with FFT (cont.)
ā€¢ So the following algorithm computes the solution of the system to
machine precision in O(N log(N)) ops.
Ė†b = ļ¬€t(b), (transform r.h.s)
Ė†x = Dāˆ’1 Ė†b, (solve diagonal system O(N))
x = iļ¬€t(Ė†x), (anti-transform to get the sol. vector)
ā€¢ Total cost: 2 FFTā€™s, plus one element-by-element vector multiply (the
reciprocals of the values of the diagonal of D are precomputed)
ā€¢ In order to precompute the diagonal values of D,
We take any vector z and compute y = Az,
then transform Ė†z = ļ¬€t(z), Ė†y = ļ¬€t(y),
Djj = Ė†yj/Ė†zj.
CIMEC-CONICET-UNL 8
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
FFT computing rates in GPGPU. GTX-580 and Titan Black
1e+5 1e+6 1e+7 1e+8
0
100
200
300
400
500
600
580/DP
580/SP
TB/DP
TB/SP
GTX 580
GTX Titan Black
Launch Feb 2014, Arch: Kepler,
USD 1,500, 2880 cores, 5.12 Tflops SP
Launch June 2011, Arch: Fermi,
USD 1,000, 772 cores, 1.5 Tflops SP
GTX 580 DP
FFTproc.rate[Gflops/sec]
Ncell*t
64x64x64
128x64x64
128x128x64
128x128x128
256x128x128
256x256x128
256x256x256
512x256x256
GTX 580 SP
GTX TB DP
[*] Ncell=2*Nx*Ny*Nz because FFT is complex
GTX TB SP
CIMEC-CONICET-UNL 9
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Upper bound on computing rate for the NS solver w/FFT
ā€¢ For instance for a mesh of N = 2563
= 16Mcell the rate is 4.66 Gcell/s
(SP).
ā€¢ So that one could perform one Poisson solution (two FFTs) in 4.2ms
(doesnā€™t take into account the computation of the RHS, which is
negligible).
ā€¢ Poisson eq. is a large component of the computing time in the Fractional
Step Solver for Navier-Stokes (and the only one that is > O(N)) so that
this puts an upper limit on the computing rate achievable with our
algorithm (based on FFT).
CIMEC-CONICET-UNL 10
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Solution of NS on embedded geometries: FVM/IBM1
ā€¢ When solid bodies are
included in the ļ¬‚uid, we
have to deal with curved
surfaces that in general do
not ļ¬t with node planes.
ā€¢ The ļ¬rst approach is to ļ¬t
the body surface by a
stair-case one (a.k.a. Lego
surface).
ā€¢ Convergence is degraded
to 1st order.
ā€¢ Stair-case acts like an
artiļ¬cial roughness so it
tends to give higher drag.
solid body
fluid
CIMEC-CONICET-UNL 11
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Solving Poisson with embedded bodies
ā€¢ FFT solver and GMG are very fast but have several restrictions: invariance
of translation, periodic boundary conditions.
ā€¢ No holes (solid bodies) in the ļ¬‚uid domain.
ā€¢ So they are not well suited for direct use on embedded geometries.
ā€¢ One approach for the solution is the IOP (Iterated Orthogonal Projection)
algorithm.
ā€¢ It is based on solving iteratively the Poisson eq. on the whole domain
(ļ¬‚uid+solid). Solving in the whole domain is fast, because algorithms like
Geometric Multigrid or FFT can be used. Also, they are very efļ¬cient
running on GPUā€™s .
CIMEC-CONICET-UNL 12
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
The IOP (Iterated Orthogonal Projection) method
The method is based on succesively solve for the incompressibility condition
(on the whole domain: solid+ļ¬‚uid), and impose the boundary condition.
on the whole
domain (fluid+solid)
violates impenetrability b.c.
satisfies impenetrability b.c.
CIMEC-CONICET-UNL 13
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
The IOP (Iterated Orthogonal Projection) method (cont.)
ā€¢ Fixed point iteration
wk+1
= Ī bdyĪ divwk
.
ā€¢ Projection on the space of
divergence-free velocity ļ¬elds:
u = Ī div(u)
u = u āˆ’ P,
āˆ†P = Ā· u,
ā€¢ Projection on the space of velocity
ļ¬elds that satisfy the
impenetrability boundary condition
u = Ī bdy(u )
u = ubdy, in ā„¦bdy,
u = u , in ā„¦ļ¬‚uid.
CIMEC-CONICET-UNL 14
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Accelerated Global Preconditioning (AGP) algorithm
ā€¢ Storti et.al. ā€A FFT preconditioning technique for the solution of
incompressible ļ¬‚ow on GPUs.ā€ Computers & Fluids 74 (2013): 44-57.
ā€¢ It shares some features with IOP scheme.
ā€¢ Both solvers are based on the fact that an efļ¬cient preconditioning exists.
The preco. consists in solving the Poisson problem on the global domain
(ļ¬‚uid+solid, i.e. no ā€œholesā€ in the ļ¬‚uid domain).
ā€¢ IOP is a stationary method and its limit rate of convergence doesnā€™t
depend on reļ¬nement.
ā€¢ However id does depend on (O(AR)) where AR is the maximum aspect
ratio of bodies inside the ļ¬‚uid. I.e. the method is bad when very slender
bodies are present.
ā€¢ Our proposed method AGP is a preconditioned Krylov space method.
ā€¢ The condition number of the preconditioned system doesnā€™t depend on
reļ¬nement but deteriorates with the AR of immersed bodies.
ā€¢ IOP iterates over both the velocity and pressure ļ¬elds, whereas AGP
iterates only on the pressure vector (which is better for implementation on
GPUā€™s).
CIMEC-CONICET-UNL 15
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
MOC+BFECC
ā€¢ The Method of Characteristics (MOC) is a Lagrangian method. The
position of the node (ā€œparticleā€) is tracked backwards in time and the
values at the previous time step are interpolated into that position. These
projections introduce dissipation.
ā€¢ BFECC Back and Forth Error Compensation and Correction can ļ¬x this.
ā€¢ Assume we have a low order (dissipative) operator (may be MOC, full
upwind, or any other) Ī¦t+āˆ†t
= L(Ī¦t
, u).
ā€¢ BFECC allows to eliminate the dissipation error.
Advance forward the state Ī¦t+āˆ†t,āˆ—
= L(Ī¦t
, u).
Advance backwards the state Ī¦t,āˆ—
= L(Ī¦t+āˆ†t,āˆ—
, āˆ’u).
CIMEC-CONICET-UNL 16
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
MOC+BFECC (cont.)
exact w/dissipation error
ā€¢ If L introduces some dissipative error , then Ī¦t,āˆ—
= Ī¦t
, in fact
Ī¦t,āˆ—
= Ī¦t
+ 2 .
ā€¢ So that we can compensate for the error:
Ī¦t+āˆ†t
= L(Ī¦t
, āˆ†t) āˆ’ ,
= Ī¦t+āˆ†t,āˆ—
āˆ’ 1/2(Ī¦t,āˆ—
āˆ’ Ī¦t
)
(2)
CIMEC-CONICET-UNL 17
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
MOC+BFECC (cont.)
Nbr of Cells QUICK-SP BFECC-SP QUICK-DP BFECC-DP
64 Ɨ 64 Ɨ 64 29.09 12.38 15.9 5.23
128 Ɨ 128 Ɨ 128 75.74 18.00 28.6 7.29
192 Ɨ 192 Ɨ 192 78.32 17.81 30.3 7.52
Cubic cavity. Computing rates for the whole NS solver (one step) in [Mcell/sec] obtained with the
BFECC and QUICK algorithms on a NVIDIA GTX 580. 3 Poisson iterations were used.
Costarelli, et.al. ā€GPGPU implementation of the BFECC algorithm for pure
advection equations.ā€ Cluster Computing 17(2) (2014): 243-254.
CIMEC-CONICET-UNL 18
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Analysis of performance
ā€¢ Regarding the performance results shown above, it can be seen that the
computing rate of QUICK is at most 4x faster than that of BFECC. So
BFECC is more efļ¬cient than QUICK whenever used with CFL > 2, being
the critical CFL for QUICK 0.5. The CFL used in our simulations is typically
CFL ā‰ˆ 5 and, thus, at this CFL the BFECC version runs 2.5 times faster
than the QUICK version.
ā€¢ Effective computing rate is more representative than the plain computing
rate
effective computing rate = CFLcrit Ā·
(number of cells)
(elapsed time)
ā€¢ The speedup of MOC+BFECC versus QUICK increases with the number of
Poisson iterations. In the limit of very large number of iters (very low
tolerance in the tolerance for Poisson) we expect a speedup 10x (equal to
the CFL ratio).
CIMEC-CONICET-UNL 19
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Validation. A buoyant fully submerged sphere
ā€¢ A ļ¬‚uid structure interaction problem has been
used for validation.
ā€¢ Fluid is water Ļļ¬‚uid = 1000 [kg/m3
].
ā€¢ The body is a smooth sphere D =10cm, fully
submerged in a box of L =39cm. Solid
density is Ļbdy = 366 [kg/m3
] ā‰ˆ 0.4 Ļļ¬‚uid
so the body has positive buoyancy. The buoy
(sphere) is tied by a string to an anchor point
at the center of the bottom side of the box.
ā€¢ The box is subject to horizontal harmonic
oscillations
xbox = Abox sin(Ļ‰t),
CIMEC-CONICET-UNL 20
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Spring-mass-damper oscillator
ā€¢ The acceleration of the box creates a driving force on the sphere. Note
that as the sphere is positively buoyant the driving force is in phase with
the displacement. That means that if the box accelerates to the right, then
the sphere tends to move in the same direction. (launch video p2-f09)
ā€¢ The buoyancy of the sphere produces a restoring force that tends to keep
aligned the string with the vertical direction.
ā€¢ The mass of the sphere plus the added mass of the liquid gives inertia.
ā€¢ The drag of the ļ¬‚uid gives a damping effect.
ā€¢ So the system can be approximated as a (nonlinear) spring-mass-damper
oscillator
inertia
mtot ĀØx +
damp
0.5Ļļ¬‚CdA|v| Ė™x +
spring
(mļ¬‚otg/L)x =
driving frc
mļ¬‚otabox,
mtot = mbdy + madd, mļ¬‚ot = md āˆ’ mbdy,
madd = Cmaddmd, Cmadd ā‰ˆ 0.5, md = Ļļ¬‚Vbdy
CIMEC-CONICET-UNL 21
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Experimental results
ā€¢ Experiments carried out atUniv Santiago de
Chile (Dra. Marcela Cruchaga).
ā€¢ Plexiglas box mounted on a shake table,
subject to oscillations of 2cm amplitude in
freqs 0.5-1.5 Hz.
ā€¢ Sphere position is obtained using Motion
Capture (MoCap) from high speed (HS) camera
(800x800 pixels @ 120fps). HS cam is AOS
Q-PRI, can do 3 MPixel @500fps, max frame
rate is 100ā€™000 fps.
ā€¢ MoCap is performed with a home-made code.
Finds current position by minimization of a
functional. Can also detect rotations. In this
exp setup 1 pixel=0.5mm (D=200 px). Camera
distortion and refraction have been evaluated
to be negligible. MoCap can obtain subpixel
resolution. (launch video mocap2)
CIMEC-CONICET-UNL 22
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Experimental results (cont.)
0
20
40
60
80
100
120
140
160
180
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
phase[deg]
freq[Hz]
anly(CD=0.3,Cmadd=0.55)
exp
0
0.02
0.04
0.06
0.08
0.1
0.4 0.6 0.8 1 1.2 1.4 1.6
Amplitude[m]
freq[Hz]
anly(CD=0.3,Cmadd=0.55)
exp
Best ļ¬t of the experimental results with the simple spring-mass-damper
oscillator are obtained for Cmadd = 0.55 and Cd = 0.25. Reference values
are:
ā€¢ Cmadd = 0.5 (potential ļ¬‚ow), Cmadd = 0.58 (Neill etal Amer J Phys
2013),
ā€¢ Cd = 0.47 (reference value). (launch video description-buoy-resoviz)
CIMEC-CONICET-UNL 23
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
abox
driving force
box desplacement
buoy displacement (f<<f0)
buoy displacement (f=f0)
buoy displacement (f>>f0) f
buoy (positive buoyancy)
pendulum (negative buoyancy)
abox
driving force
box desplacement
buoy displacement (f<<f0)
buoy displacement (f=f0)
buoy displacement (f>>f0)
f
driving force = (md āˆ’ mbdy)abox (3)
CIMEC-CONICET-UNL 24
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Computational FSI model
ā€¢ The ļ¬‚uid structure interaction is performed with a partitioned Fluid
Structure Interaction (FSI) technique.
ā€¢ Fluid is solved in step [tn
, tn+1
= tn
+ āˆ†t] between the known position
xn
of the body at tn
and an extrapolated (predicted) position xn+1,P
at
tn+1
.
ā€¢ Then ļ¬‚uid forces (pressure and viscous tractions) are computed and the
rigid body equations are solved to compute the true position of the solid
xn+1
.
ā€¢ The scheme is second order precision. It can be put as a ļ¬xed point
iteration and iterated to convergence (enhancing stability).
CIMEC-CONICET-UNL 25
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Computational FSI model (cont.)
ā€¢ The simulation is performed in a non-inertial frame attached to the box.
So, according to the Galilean equivalence principle we have just to add an
inertial force due to the acceleration of the box.
ā€¢ Due to the incompressibility this is absorbed in an horizontal ļ¬‚otation
force on the sphere, which is the driving force for the system.
ā€¢ The rigid body eqs. are now:
mbdy ĀØx + (mļ¬‚otg/L)x = Fļ¬‚uid + mļ¬‚otabox
where Fļ¬‚uid = Fdrag + Fadd are now computed with the CFD FVM/IBM1
code.
CIMEC-CONICET-UNL 26
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Why use a positive buoyant body?
ā€¢ Gov. equations for the body:
mbdy ĀØx + (mļ¬‚otg/L)x =
comp w/CFD
Fļ¬‚uid({x}) +mļ¬‚otabox
where Fļ¬‚uid = Fdrag + Fadd are ļ¬‚uid forces computed with the FVM/IBM
code.
ā€¢ The most important and harder from the point of view of the numerical
method is to compute the the ļ¬‚uid forces (drag and added mass).
ā€¢ Buoyancy is almost trivial and can be easily well captured of factored out
from the simulation.
ā€¢ Then, if the body is too heavy itā€™s not affected by the ļ¬‚uid forces, so we
prefer a body as lighter as possible.
ā€¢ In the present example mbdy = 0.192[kg], md = 0.523[kg],
madd = 0.288[kg]
CIMEC-CONICET-UNL 27
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Comparison of experimental and numerical results
0
0.02
0.04
0.06
0.08
0.1
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
Amplitude[m]
freq[Hz]
experimental
FVM/IBM1
0
50
100
150
200
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3Phase[deg]
freq[Hz]
experimental
FVM/IBM1
CIMEC-CONICET-UNL 28
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Comparison of experimental and numerical results
0
0.02
0.04
0.06
0.08
0.1
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
Amplitude[m]
freq[Hz]
experimental
FVM/IBM1
0
50
100
150
200
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
Phase[deg]
freq[Hz]
experimental
FVM/IBM1
ā€¢ Results obtained with mesh 2563
= 16.7 Mcell.
ā€¢ Maximum error 22% at resonance (f = 0.9Hz).
ā€¢ Added mass an buoyancy cancel out at resonance and amplitude is
almost controlled by drag. Hence drag is not well computed.
ā€¢ Numerical amplitude is lower than experimental =ā‡’ numerical drag is
higher than experimental.
ā€¢ Probable cause: staircase representation of surface acts as artiļ¬cial
roughness and increases drag.
ā€¢ Resonance frequency seems correct =ā‡’ added mass and buoyancy
forces are correct.
CIMEC-CONICET-UNL 29
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Convergence of FVM/IBM1
ā€¢ Plot shows amplitude vs mesh size at resonance (f = 0.9Hz).
ā€¢ The scheme is O(h) convergent.
ā€¢ Need for a O(h2
) scheme with computational efļ¬ciency comparable to
the 1st order one (FVM/IBM1).
0 0.001 0.002 0.003 0.004 0.005
4
5
6
7
8
9
10
mesh size h [m]
Abuoy[m]
FVM/IBM1
experimental
256^3 = 16.7 Mcell
192^3 = 7.1 Mcell
128^3 = 2.1 Mcell
96^3 = 0.9 Mcell
actual Lego sphere
with O(40^3) blocks
CIMEC-CONICET-UNL 30
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Immersed boundary second order scheme (FVM/IBM2)
ā€¢ Fadlun, et al. ā€Combined immersed-boundary...ā€ JCP (2000)
ā€¢ Hard to combine with staggered grids and Fractional Step schemes.
ā€¢ So switched to Rhie-Chow interpolation for using co-located grids (i.e.
avoiding staggered grids)
ā€¢ Switch to SIMPLE algorithm for iterating the pressure-velocity coupling.
ā€¢ Bodies are represented by level set functions. This allows for easy
computation of boundary operators without branch-divergence.
CIMEC-CONICET-UNL 31
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Immersed boundary second order scheme (FVM/IBM2) (cont.)
1: Data initialization: u0
, p0
, q0
, Ė™q0
2: for n = 0, Nstep do
3: uāˆ—
ā† un
, pāˆ—
ā† pn
4: for k = 0, Nsimple do
5: w ā† uāˆ—
, f ā† 0
6: for l = 0, NIBM do
7: w = Gpāˆ—
+ R(w) + f; ĀÆw = Ī BC(w, qn
S , Ė™qn
S )
8: f = ĀÆw āˆ’ Gpāˆ—
āˆ’ R(w), f = f āˆ’ f / fl=0
9: f = f , w = ĀÆw
10: If f < tolf break
11: end for
12: u ā† ĀÆw
13: Solve A(p āˆ’ pāˆ—
) = A pāˆ—
+ Du for p
14: u = u āˆ’ uāˆ—
/ u āˆ’ un
15: uāˆ—
ā† Ī±uu + (1 āˆ’ Ī±u)uāˆ—
; pāˆ—
ā† pāˆ—
+ Ī±pp
16: If u < tolu break
17: end for
18: un+1
ā† uāˆ—
, pn+1
ā† pāˆ—
19: Compute the tractions of the ļ¬‚uid onto the solid tfl
20: Compute the new solid position qn+1
S and velocity Ė™qn+1
S
21: end for
CIMEC-CONICET-UNL 32
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
The boundary projection operator
ā€¢ Given a solid cell node S we have to determine
the ļ¬ctitious (a.k.a. parasitic) velocity uf
S such
that it guarantees the satisfaction of the
boundary condition at B
ā€¢ The closest point xB to S is determined as
xS = xB āˆ’ dnB.
ā€¢ d and nB can be determined from the level set
function Ļˆ
nB = ( Ļˆ/ Ļˆ )B ā‰ˆ ( Ļˆ/ Ļˆ )S ,
d = ĻˆS,
ā€¢ Mirror ļ¬‚uid point to S is xF = xB + dnS
ā€¢ Velocity at mirror point is determined from
extrapolation uf
S = 2uB āˆ’ uF (*)
ā€¢ Forcing at S is iteratively found such that (*) is
satisļ¬ed.
Solid cell center Fluid cell center
Ļˆ<0 (solid)
Ļˆ>0 (fluid)
Ļˆ=0
S
B
nB
^
F
d
d
CIMEC-CONICET-UNL 33
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Numerical results with FVM/IBM2
0.4 0.6 0.8 1 1.2 1.4 1.6
0
0.02
0.04
0.06
0.08
0.1
f [Hz]
Ab[m]
experimental [m]
FVM/IBM1[m]
FVM/IBM2[m]
experimental
FVM/IBM1 256^3 = 16.7 Mcell
FVM/IBM2 192^3 = 7.1 Mcell
0.4 0.6 0.8 1 1.2 1.4 1.6
0
50
100
150
f [Hz]
phi[deg]
experimental [m]
FVM/IBM1[m]
FVM/IBM2[m]
experimental
FVM/IBM1 256^3 = 16.7 Mcell
FVM/IBM2 192^3 = 7.1 Mcell
CIMEC-CONICET-UNL 34
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Convergence of FVM/IBM2
ā€¢ FVM/IBM2 exhibits O(h2
) convergence.
0 0.001 0.002 0.003 0.004 0.005
4
5
6
7
8
9
10
mesh size h [m]
FVM/IBM1
FVM/IBM2
experimental
Abuoy[m]
CIMEC-CONICET-UNL 35
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Convergence of FVM/IBM2 (cont.)
ā€¢ Results for FVM/IBM2 with 1923
are much better than FVM/IBM1 with a
ļ¬ner mesh 2563
.
ā€¢ Time step for FVM/IBM1 was chosen so that CFL = 0.2 (CFLcrit = 0.5,
QUICK version).
ā€¢ Tolerance for Poisson 10āˆ’6
.
ā€¢ Results are almost converged for NIBM = 3, Nsimple = 5.
ā€¢ For FVM/IBM2 we used tolf = 1 Ɨ 10āˆ’3
, tolu = 2 Ɨ 10āˆ’3
(typically
NIBM = 1, Nsimple = 3). CFL = 0.1 (āˆ†t = 2 Ɨ 10āˆ’4
s). (Remember
CFLcrit = 0.5 āˆ’ 0.6).
ā€¢ Results have been ļ¬tted with the analytic model giving Cd,num = 0.23,
experimental= 0.25 (reference value for sphere drag is Cd = 0.47).
ā€¢ Added mass coefļ¬cient: FVM/IBM2: Cmadd = 0.58, experimental = 0.58,
reference: 0.55, potential theory: 0.5.
CIMEC-CONICET-UNL 36
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Computational efļ¬ciency of FVM/IBM2 on GPGPUs
ā€¢ Efļ¬ciency of the above algorithm has been measured on Nvidia Tesla
K40c, and GTX Titan Black.
ā€¢ K40c has 2880 CUDA cores @876 MHz, 12 GiB ECC RAM. GK110 model
(Kepler arch). (Itā€™s equivalent with K40 but has builtin fansink). Cost:
O(USD 1500).
ā€¢ GTX Titan Black has 2880 cores @980 GHz with 6 GiB RAM (no ECC).
Cost: O(USD 3100)
CIMEC-CONICET-UNL 37
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Tesla K40c vs GTX Titan Black
GTX Titan Black
2880 cores @ 980MHz2880 cores @ 875MHz
235W 250W
12 GiB ECC 6 GiB (no ECC)
7.10e9 7.08e9
4.29 Tflop/s 5.12 Tflop/s
336 GB/s288 GB/s
cores /Turbo speed
dissipation
device RAM
# transistors
memory bandwidth
SP performance
TESLA
K40
+12%
+7%
0.5x
+19%
=
+17%
cost (aprox.) USD 3,100 USD 1,500 0.5x
CIMEC-CONICET-UNL 38
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Computational efļ¬ciency of FVM/IBM2 on GPGPUs
2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07
200
400
600
800
1000
1200
1400
Ncell
computingrate[Mcell/s]
GTX,SP,NSIMPLE=1
K40c,SP,NSIMPLE=1
GTX,DP,NSIMPLE=1
K40c,DP,NSIMPLE=1
K40c,SP,NSIMPLE=3
GTX,DP,NSIMPLE=3
K40c,DP,NSIMPLE=3
GTX,SP,NSIMPLE=3
CIMEC-CONICET-UNL 39
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Computing rates for FVM/IBM2
ā€¢ Maximum attained rate is
1.3 Gcell/s for GTX Titan Black in
SP, Nsimple = 1, mesh of
2243
= 11.2 Mcell.
ā€¢ Computing rates are monotonically
increasing with reļ¬nement in all
cases.
ā€¢ Comp. rates are roughly
āˆ (NsimpleNIBM)āˆ’1
ā€¢ Titan Black outperforms K40c for
both SP and DP.
ā€¢ Remarkable good performance of
GTX Titan Black specially in DP.
2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07
0
200
400
600
800
1000
1200
Ncell
computingrate[Mcell/s]
GTX,SP,NSIMPLE=1
K40c,SP,NSIMPLE=1
GTX,DP,NSIMPLE=1
K40c,DP,NSIMPLE=1
GTX,SP,NSIMPLE=3
K40c,SP,NSIMPLE=3
GTX,DP,NSIMPLE=3
K40c,DP,NSIMPLE=3
CIMEC-CONICET-UNL 40
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Computing rates for FVM/IBM2 (cont.)
ā€¢ For 1923
(the reļ¬nement used for the results presented above) we can
compute 1s simulation in 30s of computation (ST:RT=1:30).
ā€¢ We can estimate un upper bound for the computing rate counting the
transactions to the device RAM.
ā€¢ We have 66 transactions (ļ¬‚oats/doubles) per cell. So the theoretical limit is
limit rate (SP) =
336GB/s
66 Ā· 4 Bytes/cell
= 1.27 Gcell/s,
limit rate (DP) =
336GB/s
66 Ā· 8 Bytes/cell
= 636 Mcell/s,
(4)
CIMEC-CONICET-UNL 41
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Time consumption breakdown
CIMEC-CONICET-UNL 42
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Numerical results
Vorticity isosurface
velocity at sym plane
tadpoles
colored vort isosurfaces
(launch video fsi-boya-gpu-vorticity), (launch video gpu-corte)
(launch video fsi-vort-3D), (launch video particles-buoyp5)
CIMEC-CONICET-UNL 43
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
FVM/IBM vs. LBM on GPGPU
ā€¢ In LBM the Poisson equation is not solved; it uses
pseudo-compressibility, hence LBM is a true Cellular Automata (CA)
algorithm. (FVM/IBM could so too, of course )
ā€¢ So comparatively it can reach higher computing rates, but suffers from the
artiļ¬cial Mach penalization effect.
ā€¢ Suppose you have a maximum velocity umax in the ļ¬‚ow ļ¬eld.
ā€¢ If you use pseudo-compressibility then you must choose an artiļ¬cial
speed of sound cart > umax.
ā€¢ The compressibility effects are O(M2
art) (Mart = umax/cart) so you
choose, letā€™s say, Mart = 0.1 so that the error is O(10āˆ’2
).
āˆ†t < CFLcrit Ā· h/(umax + cart)
= Mart/(1 + Mart) Ā· h/umax ā‰ˆ Mart Ā· h/umax,
CFL < Mart Ā· CFLcrit,
ā€¢ So your CFL is penalized by a factor Mart (0.1 in the example).
CIMEC-CONICET-UNL 44
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
FVM/IBM vs. LBM on GPGPU (cont.)
ā€¢ So in fact not only Mcell/s must be reported, but also the critical CFL for
the algorithm.
ā€¢ So if the computing rate in Mcell/s is doubled but have a penalization of
0.5x in the CFL the algorithm breaks even.
ā€¢ Or rather the effective computing rate should be reported
effective computing rate = CFLcrit Ā·
(number of cells)
(elapsed time)
ā€¢ The presented algorithm has (experimentally determined)
CFLcrit = 0.5 āˆ’ 0.6.
CIMEC-CONICET-UNL 45
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Future work
ā€¢ We are in the process of buying a HPC cluster (ā‰ˆ 20 Tļ¬‚ops, w/o GPGPUs)
with funds from Provincia de Santa Fe Science Agency ASACTEI (Thanks!
). Includes nodes w/GTX Titan Z (2x2880=5760 cores @876 MHz
(boost), 12 GiB RAM, equiv to 2 Titan (not Black! ) on the same card).
We estimate reaching 2 Gcell/s on one card.
ā€¢ Replace momentum eqs. solver with MOC+BFECC.
ā€¢ Multi-GPU: (hybrid MPI+CUDA+OpenMP) Some experiments already done
with Poisson eq.
ā€¢ Reļ¬nement with multi-block strategy.
ā€¢ LES Smagorinsky turbulence model.
CIMEC-CONICET-UNL 46
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
Advances in NS solver on GPU por M.Storti et.al.
Acknowledgments
ā€¢ Chilean Council for Scientiļ¬c and Technological Research (CONICYT-FONDECYT 1130278);
ā€¢ The Scientiļ¬c Research Projects Management Department of the Vice Presidency of
Research, Development and Innovation (DICYT-VRID) at Universidad de Santiago de Chile;
ā€¢ Association of Universities: Montevideo Group (AUGM);
ā€¢ Argentinean Council for Scientiļ¬c Research (CONICET project PIP 112-20111-00978);
ā€¢ Argentinean National Agency for Technological and Scientiļ¬c Promotion, ANPCyT, (grants
PICT-E-2014-0191, PICT-2014-0191);
ā€¢ Universidad Nacional del Litoral, Argentina (grant CAI+D-501-201101-00233-LI);
ā€¢ Santa Fe Science Technology and Innovation Agency (grant ASACTEI-010-18-2014); and
ā€¢ European Research Council (ERC) Advanced Grant Real Time Computational Mechanics
Techniques for Multi-Fluid Problems (REALTIME, Reference: ERC-2009-AdG).
ā€¢ We made extensive use of Free Software as GNU/Linux OS, GCC/G++ compilers, Octave, and
Open Source software as VTK, ParaView, among many others.
CIMEC-CONICET-UNL 47
((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))

More Related Content

What's hot

Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...ijcsit
Ā 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
Ā 
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH dongwook159
Ā 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
Ā 
Libxc a library of exchange and correlation functionals
Libxc a library of exchange and correlation functionalsLibxc a library of exchange and correlation functionals
Libxc a library of exchange and correlation functionalsABDERRAHMANE REGGAD
Ā 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711RCCSRENKEI
Ā 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SBrandon Liu
Ā 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverPraveen Narayanan
Ā 
Clock mesh sizing slides
Clock mesh sizing slidesClock mesh sizing slides
Clock mesh sizing slidesRajesh M
Ā 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleMarina Kolpakova
Ā 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 otoyinc
Ā 
Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0
Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0
Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0Takahiro Katagiri
Ā 
Toolchain for real-time simulations: GSN-MeteoIO-GEOtop
Toolchain for real-time simulations: GSN-MeteoIO-GEOtopToolchain for real-time simulations: GSN-MeteoIO-GEOtop
Toolchain for real-time simulations: GSN-MeteoIO-GEOtopRiccardo Rigon
Ā 
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUAcceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUMahesh Khadatare
Ā 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
Ā 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Nabil Chouba
Ā 
CMSIč؈ē®—ē§‘å­¦ęŠ€č”“ē‰¹č«–A (2015) ē¬¬13回 Parallelization of Molecular Dynamics
CMSIč؈ē®—ē§‘å­¦ęŠ€č”“ē‰¹č«–A (2015) ē¬¬13回 Parallelization of Molecular Dynamics CMSIč؈ē®—ē§‘å­¦ęŠ€č”“ē‰¹č«–A (2015) ē¬¬13回 Parallelization of Molecular Dynamics
CMSIč؈ē®—ē§‘å­¦ęŠ€č”“ē‰¹č«–A (2015) ē¬¬13回 Parallelization of Molecular Dynamics Computational Materials Science Initiative
Ā 
SPAA11
SPAA11SPAA11
SPAA11Yuan Tang
Ā 
A Short Report on Different Wavelets and Their Structures
A Short Report on Different Wavelets and Their StructuresA Short Report on Different Wavelets and Their Structures
A Short Report on Different Wavelets and Their StructuresIJRES Journal
Ā 
An Evaluation of Adaptive Partitioning of Real-Time Workloads on Linux
An Evaluation of Adaptive Partitioning of Real-Time Workloads on LinuxAn Evaluation of Adaptive Partitioning of Real-Time Workloads on Linux
An Evaluation of Adaptive Partitioning of Real-Time Workloads on Linuxtcucinotta
Ā 

What's hot (20)

Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...
Ā 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Ā 
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
Ā 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
Ā 
Libxc a library of exchange and correlation functionals
Libxc a library of exchange and correlation functionalsLibxc a library of exchange and correlation functionals
Libxc a library of exchange and correlation functionals
Ā 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711
Ā 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
Ā 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
Ā 
Clock mesh sizing slides
Clock mesh sizing slidesClock mesh sizing slides
Clock mesh sizing slides
Ā 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
Ā 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
Ā 
Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0
Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0
Overview of ppOpen-AT/Static for ppOpen-APPL/FDM ver. 0.2.0
Ā 
Toolchain for real-time simulations: GSN-MeteoIO-GEOtop
Toolchain for real-time simulations: GSN-MeteoIO-GEOtopToolchain for real-time simulations: GSN-MeteoIO-GEOtop
Toolchain for real-time simulations: GSN-MeteoIO-GEOtop
Ā 
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUAcceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Ā 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
Ā 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
Ā 
CMSIč؈ē®—ē§‘å­¦ęŠ€č”“ē‰¹č«–A (2015) ē¬¬13回 Parallelization of Molecular Dynamics
CMSIč؈ē®—ē§‘å­¦ęŠ€č”“ē‰¹č«–A (2015) ē¬¬13回 Parallelization of Molecular Dynamics CMSIč؈ē®—ē§‘å­¦ęŠ€č”“ē‰¹č«–A (2015) ē¬¬13回 Parallelization of Molecular Dynamics
CMSIč؈ē®—ē§‘å­¦ęŠ€č”“ē‰¹č«–A (2015) ē¬¬13回 Parallelization of Molecular Dynamics
Ā 
SPAA11
SPAA11SPAA11
SPAA11
Ā 
A Short Report on Different Wavelets and Their Structures
A Short Report on Different Wavelets and Their StructuresA Short Report on Different Wavelets and Their Structures
A Short Report on Different Wavelets and Their Structures
Ā 
An Evaluation of Adaptive Partitioning of Real-Time Workloads on Linux
An Evaluation of Adaptive Partitioning of Real-Time Workloads on LinuxAn Evaluation of Adaptive Partitioning of Real-Time Workloads on Linux
An Evaluation of Adaptive Partitioning of Real-Time Workloads on Linux
Ā 

Viewers also liked

מכ×Ŗב המלצה - לי×Øן פ×Øידמן
מכ×Ŗב המלצה - לי×Øן פ×Øידמןמכ×Ŗב המלצה - לי×Øן פ×Øידמן
מכ×Ŗב המלצה - לי×Øן פ×ØידמןLiran Fridman
Ā 
Algoritmos y Estructuras de Datos
Algoritmos y Estructuras de DatosAlgoritmos y Estructuras de Datos
Algoritmos y Estructuras de DatosStorti Mario
Ā 
Responsive Web Design
Responsive Web DesignResponsive Web Design
Responsive Web DesignNir Elbaz
Ā 
What's New in JHipsterLand - DevNexus 2017
What's New in JHipsterLand - DevNexus 2017What's New in JHipsterLand - DevNexus 2017
What's New in JHipsterLand - DevNexus 2017Matt Raible
Ā 
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Matt Raible
Ā 

Viewers also liked (6)

Responsive Web Designing Fundamentals
Responsive Web Designing FundamentalsResponsive Web Designing Fundamentals
Responsive Web Designing Fundamentals
Ā 
מכ×Ŗב המלצה - לי×Øן פ×Øידמן
מכ×Ŗב המלצה - לי×Øן פ×Øידמןמכ×Ŗב המלצה - לי×Øן פ×Øידמן
מכ×Ŗב המלצה - לי×Øן פ×Øידמן
Ā 
Algoritmos y Estructuras de Datos
Algoritmos y Estructuras de DatosAlgoritmos y Estructuras de Datos
Algoritmos y Estructuras de Datos
Ā 
Responsive Web Design
Responsive Web DesignResponsive Web Design
Responsive Web Design
Ā 
What's New in JHipsterLand - DevNexus 2017
What's New in JHipsterLand - DevNexus 2017What's New in JHipsterLand - DevNexus 2017
What's New in JHipsterLand - DevNexus 2017
Ā 
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Ā 

Similar to Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme and experimental validation

2022_01_TSMP_HPC_Asia_Final.pptx
2022_01_TSMP_HPC_Asia_Final.pptx2022_01_TSMP_HPC_Asia_Final.pptx
2022_01_TSMP_HPC_Asia_Final.pptxGhazal Tashakor
Ā 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1kr0y
Ā 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfShreeDevi42
Ā 
Advanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfAdvanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfHariPrasad314745
Ā 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureIJMER
Ā 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
Ā 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
Ā 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Grigory Sapunov
Ā 
Dongwook's talk on High-Performace Computing
Dongwook's talk on High-Performace ComputingDongwook's talk on High-Performace Computing
Dongwook's talk on High-Performace Computingdongwook159
Ā 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V International
Ā 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05Ɯlger Ahmet
Ā 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
Ā 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
Ā 
Sonet
SonetSonet
Sonetessafi
Ā 
Parallel implementation of pulse compression method on a multi-core digital ...
Parallel implementation of pulse compression method on  a multi-core digital ...Parallel implementation of pulse compression method on  a multi-core digital ...
Parallel implementation of pulse compression method on a multi-core digital ...IJECEIAES
Ā 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd Iaetsd
Ā 

Similar to Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme and experimental validation (20)

Gpuslides
GpuslidesGpuslides
Gpuslides
Ā 
2022_01_TSMP_HPC_Asia_Final.pptx
2022_01_TSMP_HPC_Asia_Final.pptx2022_01_TSMP_HPC_Asia_Final.pptx
2022_01_TSMP_HPC_Asia_Final.pptx
Ā 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1
Ā 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdf
Ā 
Advanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfAdvanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdf
Ā 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT Architecture
Ā 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
Ā 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
Ā 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
Ā 
Dongwook's talk on High-Performace Computing
Dongwook's talk on High-Performace ComputingDongwook's talk on High-Performace Computing
Dongwook's talk on High-Performace Computing
Ā 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentor
Ā 
Atc On An Simd Cots System Wmpp05
Atc On An Simd Cots System   Wmpp05Atc On An Simd Cots System   Wmpp05
Atc On An Simd Cots System Wmpp05
Ā 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
Ā 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
Ā 
Vsync track c
Vsync   track cVsync   track c
Vsync track c
Ā 
Sonet
SonetSonet
Sonet
Ā 
Gene's law
Gene's lawGene's law
Gene's law
Ā 
Parallel implementation of pulse compression method on a multi-core digital ...
Parallel implementation of pulse compression method on  a multi-core digital ...Parallel implementation of pulse compression method on  a multi-core digital ...
Parallel implementation of pulse compression method on a multi-core digital ...
Ā 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformation
Ā 
IMT Advanced
IMT AdvancedIMT Advanced
IMT Advanced
Ā 

More from Storti Mario

Avances en la modelizaciĆ³n computacional de problemas de interacciĆ³n Fluido-E...
Avances en la modelizaciĆ³n computacional de problemas de interacciĆ³n Fluido-E...Avances en la modelizaciĆ³n computacional de problemas de interacciĆ³n Fluido-E...
Avances en la modelizaciĆ³n computacional de problemas de interacciĆ³n Fluido-E...Storti Mario
Ā 
Programacion en C++ para Ciencia e IngenierĆ­a
Programacion en C++ para Ciencia e IngenierĆ­aProgramacion en C++ para Ciencia e IngenierĆ­a
Programacion en C++ para Ciencia e IngenierĆ­aStorti Mario
Ā 
Air pulse fruit harvester
Air pulse fruit harvesterAir pulse fruit harvester
Air pulse fruit harvesterStorti Mario
Ā 
Automatic high order absorption layers for advective-diffusive systems of equ...
Automatic high order absorption layers for advective-diffusive systems of equ...Automatic high order absorption layers for advective-diffusive systems of equ...
Automatic high order absorption layers for advective-diffusive systems of equ...Storti Mario
Ā 
El MĆ©todo CientĆ­fico
El MĆ©todo CientĆ­ficoEl MĆ©todo CientĆ­fico
El MĆ©todo CientĆ­ficoStorti Mario
Ā 
Supercomputadoras en carrera
Supercomputadoras en carreraSupercomputadoras en carrera
Supercomputadoras en carreraStorti Mario
Ā 
Cosechadora Innovar
Cosechadora InnovarCosechadora Innovar
Cosechadora InnovarStorti Mario
Ā 
HPC course on MPI, PETSC, and OpenMP
HPC course on MPI, PETSC, and OpenMPHPC course on MPI, PETSC, and OpenMP
HPC course on MPI, PETSC, and OpenMPStorti Mario
Ā 
Dynamic Absorbing Boundary Conditions
Dynamic Absorbing Boundary ConditionsDynamic Absorbing Boundary Conditions
Dynamic Absorbing Boundary ConditionsStorti Mario
Ā 
Actividades de Transferencia en CIMEC
Actividades de Transferencia en CIMECActividades de Transferencia en CIMEC
Actividades de Transferencia en CIMECStorti Mario
Ā 

More from Storti Mario (10)

Avances en la modelizaciĆ³n computacional de problemas de interacciĆ³n Fluido-E...
Avances en la modelizaciĆ³n computacional de problemas de interacciĆ³n Fluido-E...Avances en la modelizaciĆ³n computacional de problemas de interacciĆ³n Fluido-E...
Avances en la modelizaciĆ³n computacional de problemas de interacciĆ³n Fluido-E...
Ā 
Programacion en C++ para Ciencia e IngenierĆ­a
Programacion en C++ para Ciencia e IngenierĆ­aProgramacion en C++ para Ciencia e IngenierĆ­a
Programacion en C++ para Ciencia e IngenierĆ­a
Ā 
Air pulse fruit harvester
Air pulse fruit harvesterAir pulse fruit harvester
Air pulse fruit harvester
Ā 
Automatic high order absorption layers for advective-diffusive systems of equ...
Automatic high order absorption layers for advective-diffusive systems of equ...Automatic high order absorption layers for advective-diffusive systems of equ...
Automatic high order absorption layers for advective-diffusive systems of equ...
Ā 
El MĆ©todo CientĆ­fico
El MĆ©todo CientĆ­ficoEl MĆ©todo CientĆ­fico
El MĆ©todo CientĆ­fico
Ā 
Supercomputadoras en carrera
Supercomputadoras en carreraSupercomputadoras en carrera
Supercomputadoras en carrera
Ā 
Cosechadora Innovar
Cosechadora InnovarCosechadora Innovar
Cosechadora Innovar
Ā 
HPC course on MPI, PETSC, and OpenMP
HPC course on MPI, PETSC, and OpenMPHPC course on MPI, PETSC, and OpenMP
HPC course on MPI, PETSC, and OpenMP
Ā 
Dynamic Absorbing Boundary Conditions
Dynamic Absorbing Boundary ConditionsDynamic Absorbing Boundary Conditions
Dynamic Absorbing Boundary Conditions
Ā 
Actividades de Transferencia en CIMEC
Actividades de Transferencia en CIMECActividades de Transferencia en CIMEC
Actividades de Transferencia en CIMEC
Ā 

Recently uploaded

Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
Ā 
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”soniya singh
Ā 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
Ā 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
Ā 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
Ā 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
Ā 
User Guide: Pulsarā„¢ Weather Station (Columbia Weather Systems)
User Guide: Pulsarā„¢ Weather Station (Columbia Weather Systems)User Guide: Pulsarā„¢ Weather Station (Columbia Weather Systems)
User Guide: Pulsarā„¢ Weather Station (Columbia Weather Systems)Columbia Weather Systems
Ā 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
Ā 
User Guide: Orionā„¢ Weather Station (Columbia Weather Systems)
User Guide: Orionā„¢ Weather Station (Columbia Weather Systems)User Guide: Orionā„¢ Weather Station (Columbia Weather Systems)
User Guide: Orionā„¢ Weather Station (Columbia Weather Systems)Columbia Weather Systems
Ā 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
Ā 
REVISTA DE BIOLOGIA E CIƊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIƊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIƊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIƊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
Ā 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
Ā 
Call Girls in Majnu Ka Tilla Delhi šŸ”9711014705šŸ” Genuine
Call Girls in Majnu Ka Tilla Delhi šŸ”9711014705šŸ” GenuineCall Girls in Majnu Ka Tilla Delhi šŸ”9711014705šŸ” Genuine
Call Girls in Majnu Ka Tilla Delhi šŸ”9711014705šŸ” Genuinethapagita
Ā 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
Ā 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
Ā 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
Ā 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
Ā 
怊Queenslandęƕäøšę–‡å‡­-ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•ć€‹
怊Queenslandęƕäøšę–‡å‡­-ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•ć€‹ć€ŠQueenslandęƕäøšę–‡å‡­-ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•ć€‹
怊Queenslandęƕäøšę–‡å‡­-ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•ć€‹rnrncn29
Ā 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
Ā 

Recently uploaded (20)

Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
Ā 
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls in Munirka Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Ā 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
Ā 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Ā 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
Ā 
Hot Sexy call girls in Moti Nagar,šŸ” 9953056974 šŸ” escort Service
Hot Sexy call girls in  Moti Nagar,šŸ” 9953056974 šŸ” escort ServiceHot Sexy call girls in  Moti Nagar,šŸ” 9953056974 šŸ” escort Service
Hot Sexy call girls in Moti Nagar,šŸ” 9953056974 šŸ” escort Service
Ā 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
Ā 
User Guide: Pulsarā„¢ Weather Station (Columbia Weather Systems)
User Guide: Pulsarā„¢ Weather Station (Columbia Weather Systems)User Guide: Pulsarā„¢ Weather Station (Columbia Weather Systems)
User Guide: Pulsarā„¢ Weather Station (Columbia Weather Systems)
Ā 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
Ā 
User Guide: Orionā„¢ Weather Station (Columbia Weather Systems)
User Guide: Orionā„¢ Weather Station (Columbia Weather Systems)User Guide: Orionā„¢ Weather Station (Columbia Weather Systems)
User Guide: Orionā„¢ Weather Station (Columbia Weather Systems)
Ā 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
Ā 
REVISTA DE BIOLOGIA E CIƊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIƊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIƊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIƊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
Ā 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Ā 
Call Girls in Majnu Ka Tilla Delhi šŸ”9711014705šŸ” Genuine
Call Girls in Majnu Ka Tilla Delhi šŸ”9711014705šŸ” GenuineCall Girls in Majnu Ka Tilla Delhi šŸ”9711014705šŸ” Genuine
Call Girls in Majnu Ka Tilla Delhi šŸ”9711014705šŸ” Genuine
Ā 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
Ā 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
Ā 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
Ā 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
Ā 
怊Queenslandęƕäøšę–‡å‡­-ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•ć€‹
怊Queenslandęƕäøšę–‡å‡­-ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•ć€‹ć€ŠQueenslandęƕäøšę–‡å‡­-ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•ć€‹
怊Queenslandęƕäøšę–‡å‡­-ę˜†å£«å…°å¤§å­¦ęƕäøščÆęˆē»©å•ć€‹
Ā 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Ā 

Advances in the Solution of NS Eqs. in GPGPU Hardware. Second order scheme and experimental validation

  • 1. Advances in NS solver on GPU por M.Storti et.al. Advances in the Solution of NS Eqs. in GPGPU Hardware by Mario Stortiab , Santiago Costarelliab , Luciano Garellia , Marcela Cruchagac , Ronald Ausensic , S. Idelsohnabde , Gilles Perrotf , RaphaĀØel Couturierf a Centro de InvestigaciĀ“on de MĀ“etodos Computacionales, CIMEC (CONICET-UNL) Santa Fe, Argentina, mario.storti@gmail.com, www.cimec.org.ar/mstorti b Facultad de IngenierĀ“ıa y Ciencias HĀ“ıdricas. UN Litoral. Santa Fe, Argentina, ļ¬ch.unl.edu.ar c Depto Ing. MecĀ“anica, Univ Santiago de Chile d International Center for Numerical Methods in Engineering (CIMNE) Technical University of Catalonia (UPC), www.cimne.com e InstituciĀ“o Catalana de Recerca i Estudis AvancĀøats (ICREA), Barcelona, Spain, www.icrea.cat f FEMTO-ST Institute, Franche-ComtĀ“e Meso-Centre CIMEC-CONICET-UNL 1 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 2. Advances in NS solver on GPU por M.Storti et.al. Scientiļ¬c computing on GPUs ā€¢ Graphics Processing Units (GPUā€™s) are specialized hardware designed to discharge computation from the CPU for intensive graphics applications. ā€¢ They have many cores (thread processors), currently the Tesla K40 (Kepler GK110) has 2880 cores at 745 Mhz (boost 875 Mhz). ā€¢ The raw computing power is in the order of Teraļ¬‚ops (4.3 Tļ¬‚ops in SP and 1.43 Tļ¬‚ops in DP). ā€¢ Memory Bandwidth (GDDR5) 288 GiB/s. Memory size 12 GiB. ā€¢ Cost: USD 3,100. Low end version Geforce GTX Titan: USD 1,500. CIMEC-CONICET-UNL 2 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 3. Advances in NS solver on GPU por M.Storti et.al. Scientiļ¬c computing on GPUs (cont.) ā€¢ The difference between the GPUs architecture and standard multicore processors is that GPUs have much more computing units (ALUā€™s (Arithmetic-Logic Unit) and SFUā€™s (Special Function Unit), but few control units. ā€¢ The programming model is SIMD (Single Instruction Multiple Data). ā€¢ Branch-divergence happens when a branch condition gives true in some in some threads an false in others. Due to the SIMD model all threads canā€™t be executed simultaneously. ā€¢ GPUs compete with many-core processors (e.g. Intelā€™s Xeon Phi) Knights-Corner, Xeon-Phi 60 cores). if (COND) { BODY-TRUE; } else { BODY-FALSE; } CIMEC-CONICET-UNL 3 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 4. Advances in NS solver on GPU por M.Storti et.al. GPUs, MIC, and multi-core processors ā€¢ Computing power is limited by the number of transistors we can put in a chip. This is limited in turn to the largest silicon chip you can grow (O(400 mm2 )), and the minimum size of the transistors you can make (O(22 nm)). So actually the limit is in the order of 5 Ɨ 109 transistors per chip. ā€¢ So with that number of transistors you can do one of the following Have a few (6 to 12) very powerful cores: Xeon E5-26XX processors. Have 60 or more less powerful cores, each one with 40 million transistors, similar to a Pentium processor: Intel Many Integrated Core (MIC), Xeon Phi Have a very large (O(3000)) simple (without control) cores: Nvidia GPUs. Die shot of the Nvidia ā€Kepler1ā€ GK104 GPU CIMEC-CONICET-UNL 4 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 5. Advances in NS solver on GPU por M.Storti et.al. Solution of incompressible Navier-Stokes ļ¬‚ows on GPU ā€¢ GPUā€™s are less efļ¬cient for algorithms that require access to the cardā€™s (device) global memory. In older cards shared memory is much faster but usually scarce (16K per thread block in the Tesla C1060) . ā€¢ The best algorithms are those that make computations for one cell requiring only information on that cell and their neighbors. These algorithms are classiļ¬ed as cellular automata (CA). ā€¢ In newer cards there is a much larger memory cache. Anyway data locality is fundamental. ā€¢ Lattice-Boltzmann and explicit F M (FDM/FVM/FEM) fall in this category. ā€¢ Structured meshes require less data to exchange between cells (e.g. neighbor indices are computed, no stored), and so, they require less shared memory. Also, very fast solvers like FFT-based (Fast Fourier Transform) or Geometric Multigrid are available . CIMEC-CONICET-UNL 5 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 6. Advances in NS solver on GPU por M.Storti et.al. Fractional Step Method on structured grids with QUICK Proposed by Molemaker et.al. SCAā€™08: 2008 ACM SIGGRAPH, Low viscosity ļ¬‚ow simulations for animation. ā€¢ Fractional Step Method (a.k.a. pressure segregation) ā€¢ u, v, w and continuity cells are staggered (MAC=Marker And Cell). ā€¢ QUICK advection scheme is used in the predictor stage. ā€¢ Poisson system is solved with IOP (Iterated Orthogonal Projection) (to be described later), on top of Geometric MultiGrid j j+1 j+2 i+1 i+2(ih,jh) Vij PijUij x y i y-momentum cell x-momentum nodes y-momentum nodes continuity nodes x-momentum cell continuity cell CIMEC-CONICET-UNL 6 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 7. Advances in NS solver on GPU por M.Storti et.al. Solution of the Poisson with FFT ā€¢ Solution of the Poisson equation is, for large meshes, the more CPU consuming time stage in Fractional-Step like Navier-Stokes solvers. ā€¢ We have to solve a linear system Ax = b ā€¢ The Discrete Fourier Transform (DFT) is an orthogonal transformation Ė†x = Ox = ļ¬€t(x). ā€¢ The inverse transformation Oāˆ’1 = OT is the inverse Fourier Transform x = OT Ė†x = iļ¬€t(Ė†x). ā€¢ If the operator matrix A is spatially invariant (i.e. the stencil is the same at all grid points) and the b.c.ā€™s are periodic, then it can be shown that O diagonalizes A, i.e. OAOāˆ’1 = D. ā€¢ So in the transformed basis the system of equations is diagonal (OAOāˆ’1 ) (Ox) = (Ob), DĖ†x = Ė†b, (1) ā€¢ For N = 2p the Fast Fourier Transform (FFT) is an algorithm that computes the DFT (and its inverse) in O(N log(N)) operations. CIMEC-CONICET-UNL 7 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 8. Advances in NS solver on GPU por M.Storti et.al. Solution of the Poisson with FFT (cont.) ā€¢ So the following algorithm computes the solution of the system to machine precision in O(N log(N)) ops. Ė†b = ļ¬€t(b), (transform r.h.s) Ė†x = Dāˆ’1 Ė†b, (solve diagonal system O(N)) x = iļ¬€t(Ė†x), (anti-transform to get the sol. vector) ā€¢ Total cost: 2 FFTā€™s, plus one element-by-element vector multiply (the reciprocals of the values of the diagonal of D are precomputed) ā€¢ In order to precompute the diagonal values of D, We take any vector z and compute y = Az, then transform Ė†z = ļ¬€t(z), Ė†y = ļ¬€t(y), Djj = Ė†yj/Ė†zj. CIMEC-CONICET-UNL 8 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 9. Advances in NS solver on GPU por M.Storti et.al. FFT computing rates in GPGPU. GTX-580 and Titan Black 1e+5 1e+6 1e+7 1e+8 0 100 200 300 400 500 600 580/DP 580/SP TB/DP TB/SP GTX 580 GTX Titan Black Launch Feb 2014, Arch: Kepler, USD 1,500, 2880 cores, 5.12 Tflops SP Launch June 2011, Arch: Fermi, USD 1,000, 772 cores, 1.5 Tflops SP GTX 580 DP FFTproc.rate[Gflops/sec] Ncell*t 64x64x64 128x64x64 128x128x64 128x128x128 256x128x128 256x256x128 256x256x256 512x256x256 GTX 580 SP GTX TB DP [*] Ncell=2*Nx*Ny*Nz because FFT is complex GTX TB SP CIMEC-CONICET-UNL 9 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 10. Advances in NS solver on GPU por M.Storti et.al. Upper bound on computing rate for the NS solver w/FFT ā€¢ For instance for a mesh of N = 2563 = 16Mcell the rate is 4.66 Gcell/s (SP). ā€¢ So that one could perform one Poisson solution (two FFTs) in 4.2ms (doesnā€™t take into account the computation of the RHS, which is negligible). ā€¢ Poisson eq. is a large component of the computing time in the Fractional Step Solver for Navier-Stokes (and the only one that is > O(N)) so that this puts an upper limit on the computing rate achievable with our algorithm (based on FFT). CIMEC-CONICET-UNL 10 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 11. Advances in NS solver on GPU por M.Storti et.al. Solution of NS on embedded geometries: FVM/IBM1 ā€¢ When solid bodies are included in the ļ¬‚uid, we have to deal with curved surfaces that in general do not ļ¬t with node planes. ā€¢ The ļ¬rst approach is to ļ¬t the body surface by a stair-case one (a.k.a. Lego surface). ā€¢ Convergence is degraded to 1st order. ā€¢ Stair-case acts like an artiļ¬cial roughness so it tends to give higher drag. solid body fluid CIMEC-CONICET-UNL 11 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 12. Advances in NS solver on GPU por M.Storti et.al. Solving Poisson with embedded bodies ā€¢ FFT solver and GMG are very fast but have several restrictions: invariance of translation, periodic boundary conditions. ā€¢ No holes (solid bodies) in the ļ¬‚uid domain. ā€¢ So they are not well suited for direct use on embedded geometries. ā€¢ One approach for the solution is the IOP (Iterated Orthogonal Projection) algorithm. ā€¢ It is based on solving iteratively the Poisson eq. on the whole domain (ļ¬‚uid+solid). Solving in the whole domain is fast, because algorithms like Geometric Multigrid or FFT can be used. Also, they are very efļ¬cient running on GPUā€™s . CIMEC-CONICET-UNL 12 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 13. Advances in NS solver on GPU por M.Storti et.al. The IOP (Iterated Orthogonal Projection) method The method is based on succesively solve for the incompressibility condition (on the whole domain: solid+ļ¬‚uid), and impose the boundary condition. on the whole domain (fluid+solid) violates impenetrability b.c. satisfies impenetrability b.c. CIMEC-CONICET-UNL 13 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 14. Advances in NS solver on GPU por M.Storti et.al. The IOP (Iterated Orthogonal Projection) method (cont.) ā€¢ Fixed point iteration wk+1 = Ī bdyĪ divwk . ā€¢ Projection on the space of divergence-free velocity ļ¬elds: u = Ī div(u) u = u āˆ’ P, āˆ†P = Ā· u, ā€¢ Projection on the space of velocity ļ¬elds that satisfy the impenetrability boundary condition u = Ī bdy(u ) u = ubdy, in ā„¦bdy, u = u , in ā„¦ļ¬‚uid. CIMEC-CONICET-UNL 14 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 15. Advances in NS solver on GPU por M.Storti et.al. Accelerated Global Preconditioning (AGP) algorithm ā€¢ Storti et.al. ā€A FFT preconditioning technique for the solution of incompressible ļ¬‚ow on GPUs.ā€ Computers & Fluids 74 (2013): 44-57. ā€¢ It shares some features with IOP scheme. ā€¢ Both solvers are based on the fact that an efļ¬cient preconditioning exists. The preco. consists in solving the Poisson problem on the global domain (ļ¬‚uid+solid, i.e. no ā€œholesā€ in the ļ¬‚uid domain). ā€¢ IOP is a stationary method and its limit rate of convergence doesnā€™t depend on reļ¬nement. ā€¢ However id does depend on (O(AR)) where AR is the maximum aspect ratio of bodies inside the ļ¬‚uid. I.e. the method is bad when very slender bodies are present. ā€¢ Our proposed method AGP is a preconditioned Krylov space method. ā€¢ The condition number of the preconditioned system doesnā€™t depend on reļ¬nement but deteriorates with the AR of immersed bodies. ā€¢ IOP iterates over both the velocity and pressure ļ¬elds, whereas AGP iterates only on the pressure vector (which is better for implementation on GPUā€™s). CIMEC-CONICET-UNL 15 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 16. Advances in NS solver on GPU por M.Storti et.al. MOC+BFECC ā€¢ The Method of Characteristics (MOC) is a Lagrangian method. The position of the node (ā€œparticleā€) is tracked backwards in time and the values at the previous time step are interpolated into that position. These projections introduce dissipation. ā€¢ BFECC Back and Forth Error Compensation and Correction can ļ¬x this. ā€¢ Assume we have a low order (dissipative) operator (may be MOC, full upwind, or any other) Ī¦t+āˆ†t = L(Ī¦t , u). ā€¢ BFECC allows to eliminate the dissipation error. Advance forward the state Ī¦t+āˆ†t,āˆ— = L(Ī¦t , u). Advance backwards the state Ī¦t,āˆ— = L(Ī¦t+āˆ†t,āˆ— , āˆ’u). CIMEC-CONICET-UNL 16 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 17. Advances in NS solver on GPU por M.Storti et.al. MOC+BFECC (cont.) exact w/dissipation error ā€¢ If L introduces some dissipative error , then Ī¦t,āˆ— = Ī¦t , in fact Ī¦t,āˆ— = Ī¦t + 2 . ā€¢ So that we can compensate for the error: Ī¦t+āˆ†t = L(Ī¦t , āˆ†t) āˆ’ , = Ī¦t+āˆ†t,āˆ— āˆ’ 1/2(Ī¦t,āˆ— āˆ’ Ī¦t ) (2) CIMEC-CONICET-UNL 17 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 18. Advances in NS solver on GPU por M.Storti et.al. MOC+BFECC (cont.) Nbr of Cells QUICK-SP BFECC-SP QUICK-DP BFECC-DP 64 Ɨ 64 Ɨ 64 29.09 12.38 15.9 5.23 128 Ɨ 128 Ɨ 128 75.74 18.00 28.6 7.29 192 Ɨ 192 Ɨ 192 78.32 17.81 30.3 7.52 Cubic cavity. Computing rates for the whole NS solver (one step) in [Mcell/sec] obtained with the BFECC and QUICK algorithms on a NVIDIA GTX 580. 3 Poisson iterations were used. Costarelli, et.al. ā€GPGPU implementation of the BFECC algorithm for pure advection equations.ā€ Cluster Computing 17(2) (2014): 243-254. CIMEC-CONICET-UNL 18 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 19. Advances in NS solver on GPU por M.Storti et.al. Analysis of performance ā€¢ Regarding the performance results shown above, it can be seen that the computing rate of QUICK is at most 4x faster than that of BFECC. So BFECC is more efļ¬cient than QUICK whenever used with CFL > 2, being the critical CFL for QUICK 0.5. The CFL used in our simulations is typically CFL ā‰ˆ 5 and, thus, at this CFL the BFECC version runs 2.5 times faster than the QUICK version. ā€¢ Effective computing rate is more representative than the plain computing rate effective computing rate = CFLcrit Ā· (number of cells) (elapsed time) ā€¢ The speedup of MOC+BFECC versus QUICK increases with the number of Poisson iterations. In the limit of very large number of iters (very low tolerance in the tolerance for Poisson) we expect a speedup 10x (equal to the CFL ratio). CIMEC-CONICET-UNL 19 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 20. Advances in NS solver on GPU por M.Storti et.al. Validation. A buoyant fully submerged sphere ā€¢ A ļ¬‚uid structure interaction problem has been used for validation. ā€¢ Fluid is water Ļļ¬‚uid = 1000 [kg/m3 ]. ā€¢ The body is a smooth sphere D =10cm, fully submerged in a box of L =39cm. Solid density is Ļbdy = 366 [kg/m3 ] ā‰ˆ 0.4 Ļļ¬‚uid so the body has positive buoyancy. The buoy (sphere) is tied by a string to an anchor point at the center of the bottom side of the box. ā€¢ The box is subject to horizontal harmonic oscillations xbox = Abox sin(Ļ‰t), CIMEC-CONICET-UNL 20 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 21. Advances in NS solver on GPU por M.Storti et.al. Spring-mass-damper oscillator ā€¢ The acceleration of the box creates a driving force on the sphere. Note that as the sphere is positively buoyant the driving force is in phase with the displacement. That means that if the box accelerates to the right, then the sphere tends to move in the same direction. (launch video p2-f09) ā€¢ The buoyancy of the sphere produces a restoring force that tends to keep aligned the string with the vertical direction. ā€¢ The mass of the sphere plus the added mass of the liquid gives inertia. ā€¢ The drag of the ļ¬‚uid gives a damping effect. ā€¢ So the system can be approximated as a (nonlinear) spring-mass-damper oscillator inertia mtot ĀØx + damp 0.5Ļļ¬‚CdA|v| Ė™x + spring (mļ¬‚otg/L)x = driving frc mļ¬‚otabox, mtot = mbdy + madd, mļ¬‚ot = md āˆ’ mbdy, madd = Cmaddmd, Cmadd ā‰ˆ 0.5, md = Ļļ¬‚Vbdy CIMEC-CONICET-UNL 21 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 22. Advances in NS solver on GPU por M.Storti et.al. Experimental results ā€¢ Experiments carried out atUniv Santiago de Chile (Dra. Marcela Cruchaga). ā€¢ Plexiglas box mounted on a shake table, subject to oscillations of 2cm amplitude in freqs 0.5-1.5 Hz. ā€¢ Sphere position is obtained using Motion Capture (MoCap) from high speed (HS) camera (800x800 pixels @ 120fps). HS cam is AOS Q-PRI, can do 3 MPixel @500fps, max frame rate is 100ā€™000 fps. ā€¢ MoCap is performed with a home-made code. Finds current position by minimization of a functional. Can also detect rotations. In this exp setup 1 pixel=0.5mm (D=200 px). Camera distortion and refraction have been evaluated to be negligible. MoCap can obtain subpixel resolution. (launch video mocap2) CIMEC-CONICET-UNL 22 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 23. Advances in NS solver on GPU por M.Storti et.al. Experimental results (cont.) 0 20 40 60 80 100 120 140 160 180 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 phase[deg] freq[Hz] anly(CD=0.3,Cmadd=0.55) exp 0 0.02 0.04 0.06 0.08 0.1 0.4 0.6 0.8 1 1.2 1.4 1.6 Amplitude[m] freq[Hz] anly(CD=0.3,Cmadd=0.55) exp Best ļ¬t of the experimental results with the simple spring-mass-damper oscillator are obtained for Cmadd = 0.55 and Cd = 0.25. Reference values are: ā€¢ Cmadd = 0.5 (potential ļ¬‚ow), Cmadd = 0.58 (Neill etal Amer J Phys 2013), ā€¢ Cd = 0.47 (reference value). (launch video description-buoy-resoviz) CIMEC-CONICET-UNL 23 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 24. Advances in NS solver on GPU por M.Storti et.al. abox driving force box desplacement buoy displacement (f<<f0) buoy displacement (f=f0) buoy displacement (f>>f0) f buoy (positive buoyancy) pendulum (negative buoyancy) abox driving force box desplacement buoy displacement (f<<f0) buoy displacement (f=f0) buoy displacement (f>>f0) f driving force = (md āˆ’ mbdy)abox (3) CIMEC-CONICET-UNL 24 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 25. Advances in NS solver on GPU por M.Storti et.al. Computational FSI model ā€¢ The ļ¬‚uid structure interaction is performed with a partitioned Fluid Structure Interaction (FSI) technique. ā€¢ Fluid is solved in step [tn , tn+1 = tn + āˆ†t] between the known position xn of the body at tn and an extrapolated (predicted) position xn+1,P at tn+1 . ā€¢ Then ļ¬‚uid forces (pressure and viscous tractions) are computed and the rigid body equations are solved to compute the true position of the solid xn+1 . ā€¢ The scheme is second order precision. It can be put as a ļ¬xed point iteration and iterated to convergence (enhancing stability). CIMEC-CONICET-UNL 25 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 26. Advances in NS solver on GPU por M.Storti et.al. Computational FSI model (cont.) ā€¢ The simulation is performed in a non-inertial frame attached to the box. So, according to the Galilean equivalence principle we have just to add an inertial force due to the acceleration of the box. ā€¢ Due to the incompressibility this is absorbed in an horizontal ļ¬‚otation force on the sphere, which is the driving force for the system. ā€¢ The rigid body eqs. are now: mbdy ĀØx + (mļ¬‚otg/L)x = Fļ¬‚uid + mļ¬‚otabox where Fļ¬‚uid = Fdrag + Fadd are now computed with the CFD FVM/IBM1 code. CIMEC-CONICET-UNL 26 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 27. Advances in NS solver on GPU por M.Storti et.al. Why use a positive buoyant body? ā€¢ Gov. equations for the body: mbdy ĀØx + (mļ¬‚otg/L)x = comp w/CFD Fļ¬‚uid({x}) +mļ¬‚otabox where Fļ¬‚uid = Fdrag + Fadd are ļ¬‚uid forces computed with the FVM/IBM code. ā€¢ The most important and harder from the point of view of the numerical method is to compute the the ļ¬‚uid forces (drag and added mass). ā€¢ Buoyancy is almost trivial and can be easily well captured of factored out from the simulation. ā€¢ Then, if the body is too heavy itā€™s not affected by the ļ¬‚uid forces, so we prefer a body as lighter as possible. ā€¢ In the present example mbdy = 0.192[kg], md = 0.523[kg], madd = 0.288[kg] CIMEC-CONICET-UNL 27 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 28. Advances in NS solver on GPU por M.Storti et.al. Comparison of experimental and numerical results 0 0.02 0.04 0.06 0.08 0.1 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 Amplitude[m] freq[Hz] experimental FVM/IBM1 0 50 100 150 200 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3Phase[deg] freq[Hz] experimental FVM/IBM1 CIMEC-CONICET-UNL 28 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 29. Advances in NS solver on GPU por M.Storti et.al. Comparison of experimental and numerical results 0 0.02 0.04 0.06 0.08 0.1 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 Amplitude[m] freq[Hz] experimental FVM/IBM1 0 50 100 150 200 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 Phase[deg] freq[Hz] experimental FVM/IBM1 ā€¢ Results obtained with mesh 2563 = 16.7 Mcell. ā€¢ Maximum error 22% at resonance (f = 0.9Hz). ā€¢ Added mass an buoyancy cancel out at resonance and amplitude is almost controlled by drag. Hence drag is not well computed. ā€¢ Numerical amplitude is lower than experimental =ā‡’ numerical drag is higher than experimental. ā€¢ Probable cause: staircase representation of surface acts as artiļ¬cial roughness and increases drag. ā€¢ Resonance frequency seems correct =ā‡’ added mass and buoyancy forces are correct. CIMEC-CONICET-UNL 29 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 30. Advances in NS solver on GPU por M.Storti et.al. Convergence of FVM/IBM1 ā€¢ Plot shows amplitude vs mesh size at resonance (f = 0.9Hz). ā€¢ The scheme is O(h) convergent. ā€¢ Need for a O(h2 ) scheme with computational efļ¬ciency comparable to the 1st order one (FVM/IBM1). 0 0.001 0.002 0.003 0.004 0.005 4 5 6 7 8 9 10 mesh size h [m] Abuoy[m] FVM/IBM1 experimental 256^3 = 16.7 Mcell 192^3 = 7.1 Mcell 128^3 = 2.1 Mcell 96^3 = 0.9 Mcell actual Lego sphere with O(40^3) blocks CIMEC-CONICET-UNL 30 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 31. Advances in NS solver on GPU por M.Storti et.al. Immersed boundary second order scheme (FVM/IBM2) ā€¢ Fadlun, et al. ā€Combined immersed-boundary...ā€ JCP (2000) ā€¢ Hard to combine with staggered grids and Fractional Step schemes. ā€¢ So switched to Rhie-Chow interpolation for using co-located grids (i.e. avoiding staggered grids) ā€¢ Switch to SIMPLE algorithm for iterating the pressure-velocity coupling. ā€¢ Bodies are represented by level set functions. This allows for easy computation of boundary operators without branch-divergence. CIMEC-CONICET-UNL 31 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 32. Advances in NS solver on GPU por M.Storti et.al. Immersed boundary second order scheme (FVM/IBM2) (cont.) 1: Data initialization: u0 , p0 , q0 , Ė™q0 2: for n = 0, Nstep do 3: uāˆ— ā† un , pāˆ— ā† pn 4: for k = 0, Nsimple do 5: w ā† uāˆ— , f ā† 0 6: for l = 0, NIBM do 7: w = Gpāˆ— + R(w) + f; ĀÆw = Ī BC(w, qn S , Ė™qn S ) 8: f = ĀÆw āˆ’ Gpāˆ— āˆ’ R(w), f = f āˆ’ f / fl=0 9: f = f , w = ĀÆw 10: If f < tolf break 11: end for 12: u ā† ĀÆw 13: Solve A(p āˆ’ pāˆ— ) = A pāˆ— + Du for p 14: u = u āˆ’ uāˆ— / u āˆ’ un 15: uāˆ— ā† Ī±uu + (1 āˆ’ Ī±u)uāˆ— ; pāˆ— ā† pāˆ— + Ī±pp 16: If u < tolu break 17: end for 18: un+1 ā† uāˆ— , pn+1 ā† pāˆ— 19: Compute the tractions of the ļ¬‚uid onto the solid tfl 20: Compute the new solid position qn+1 S and velocity Ė™qn+1 S 21: end for CIMEC-CONICET-UNL 32 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 33. Advances in NS solver on GPU por M.Storti et.al. The boundary projection operator ā€¢ Given a solid cell node S we have to determine the ļ¬ctitious (a.k.a. parasitic) velocity uf S such that it guarantees the satisfaction of the boundary condition at B ā€¢ The closest point xB to S is determined as xS = xB āˆ’ dnB. ā€¢ d and nB can be determined from the level set function Ļˆ nB = ( Ļˆ/ Ļˆ )B ā‰ˆ ( Ļˆ/ Ļˆ )S , d = ĻˆS, ā€¢ Mirror ļ¬‚uid point to S is xF = xB + dnS ā€¢ Velocity at mirror point is determined from extrapolation uf S = 2uB āˆ’ uF (*) ā€¢ Forcing at S is iteratively found such that (*) is satisļ¬ed. Solid cell center Fluid cell center Ļˆ<0 (solid) Ļˆ>0 (fluid) Ļˆ=0 S B nB ^ F d d CIMEC-CONICET-UNL 33 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 34. Advances in NS solver on GPU por M.Storti et.al. Numerical results with FVM/IBM2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.02 0.04 0.06 0.08 0.1 f [Hz] Ab[m] experimental [m] FVM/IBM1[m] FVM/IBM2[m] experimental FVM/IBM1 256^3 = 16.7 Mcell FVM/IBM2 192^3 = 7.1 Mcell 0.4 0.6 0.8 1 1.2 1.4 1.6 0 50 100 150 f [Hz] phi[deg] experimental [m] FVM/IBM1[m] FVM/IBM2[m] experimental FVM/IBM1 256^3 = 16.7 Mcell FVM/IBM2 192^3 = 7.1 Mcell CIMEC-CONICET-UNL 34 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 35. Advances in NS solver on GPU por M.Storti et.al. Convergence of FVM/IBM2 ā€¢ FVM/IBM2 exhibits O(h2 ) convergence. 0 0.001 0.002 0.003 0.004 0.005 4 5 6 7 8 9 10 mesh size h [m] FVM/IBM1 FVM/IBM2 experimental Abuoy[m] CIMEC-CONICET-UNL 35 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 36. Advances in NS solver on GPU por M.Storti et.al. Convergence of FVM/IBM2 (cont.) ā€¢ Results for FVM/IBM2 with 1923 are much better than FVM/IBM1 with a ļ¬ner mesh 2563 . ā€¢ Time step for FVM/IBM1 was chosen so that CFL = 0.2 (CFLcrit = 0.5, QUICK version). ā€¢ Tolerance for Poisson 10āˆ’6 . ā€¢ Results are almost converged for NIBM = 3, Nsimple = 5. ā€¢ For FVM/IBM2 we used tolf = 1 Ɨ 10āˆ’3 , tolu = 2 Ɨ 10āˆ’3 (typically NIBM = 1, Nsimple = 3). CFL = 0.1 (āˆ†t = 2 Ɨ 10āˆ’4 s). (Remember CFLcrit = 0.5 āˆ’ 0.6). ā€¢ Results have been ļ¬tted with the analytic model giving Cd,num = 0.23, experimental= 0.25 (reference value for sphere drag is Cd = 0.47). ā€¢ Added mass coefļ¬cient: FVM/IBM2: Cmadd = 0.58, experimental = 0.58, reference: 0.55, potential theory: 0.5. CIMEC-CONICET-UNL 36 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 37. Advances in NS solver on GPU por M.Storti et.al. Computational efļ¬ciency of FVM/IBM2 on GPGPUs ā€¢ Efļ¬ciency of the above algorithm has been measured on Nvidia Tesla K40c, and GTX Titan Black. ā€¢ K40c has 2880 CUDA cores @876 MHz, 12 GiB ECC RAM. GK110 model (Kepler arch). (Itā€™s equivalent with K40 but has builtin fansink). Cost: O(USD 1500). ā€¢ GTX Titan Black has 2880 cores @980 GHz with 6 GiB RAM (no ECC). Cost: O(USD 3100) CIMEC-CONICET-UNL 37 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 38. Advances in NS solver on GPU por M.Storti et.al. Tesla K40c vs GTX Titan Black GTX Titan Black 2880 cores @ 980MHz2880 cores @ 875MHz 235W 250W 12 GiB ECC 6 GiB (no ECC) 7.10e9 7.08e9 4.29 Tflop/s 5.12 Tflop/s 336 GB/s288 GB/s cores /Turbo speed dissipation device RAM # transistors memory bandwidth SP performance TESLA K40 +12% +7% 0.5x +19% = +17% cost (aprox.) USD 3,100 USD 1,500 0.5x CIMEC-CONICET-UNL 38 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 39. Advances in NS solver on GPU por M.Storti et.al. Computational efļ¬ciency of FVM/IBM2 on GPGPUs 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 200 400 600 800 1000 1200 1400 Ncell computingrate[Mcell/s] GTX,SP,NSIMPLE=1 K40c,SP,NSIMPLE=1 GTX,DP,NSIMPLE=1 K40c,DP,NSIMPLE=1 K40c,SP,NSIMPLE=3 GTX,DP,NSIMPLE=3 K40c,DP,NSIMPLE=3 GTX,SP,NSIMPLE=3 CIMEC-CONICET-UNL 39 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 40. Advances in NS solver on GPU por M.Storti et.al. Computing rates for FVM/IBM2 ā€¢ Maximum attained rate is 1.3 Gcell/s for GTX Titan Black in SP, Nsimple = 1, mesh of 2243 = 11.2 Mcell. ā€¢ Computing rates are monotonically increasing with reļ¬nement in all cases. ā€¢ Comp. rates are roughly āˆ (NsimpleNIBM)āˆ’1 ā€¢ Titan Black outperforms K40c for both SP and DP. ā€¢ Remarkable good performance of GTX Titan Black specially in DP. 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 0 200 400 600 800 1000 1200 Ncell computingrate[Mcell/s] GTX,SP,NSIMPLE=1 K40c,SP,NSIMPLE=1 GTX,DP,NSIMPLE=1 K40c,DP,NSIMPLE=1 GTX,SP,NSIMPLE=3 K40c,SP,NSIMPLE=3 GTX,DP,NSIMPLE=3 K40c,DP,NSIMPLE=3 CIMEC-CONICET-UNL 40 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 41. Advances in NS solver on GPU por M.Storti et.al. Computing rates for FVM/IBM2 (cont.) ā€¢ For 1923 (the reļ¬nement used for the results presented above) we can compute 1s simulation in 30s of computation (ST:RT=1:30). ā€¢ We can estimate un upper bound for the computing rate counting the transactions to the device RAM. ā€¢ We have 66 transactions (ļ¬‚oats/doubles) per cell. So the theoretical limit is limit rate (SP) = 336GB/s 66 Ā· 4 Bytes/cell = 1.27 Gcell/s, limit rate (DP) = 336GB/s 66 Ā· 8 Bytes/cell = 636 Mcell/s, (4) CIMEC-CONICET-UNL 41 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 42. Advances in NS solver on GPU por M.Storti et.al. Time consumption breakdown CIMEC-CONICET-UNL 42 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 43. Advances in NS solver on GPU por M.Storti et.al. Numerical results Vorticity isosurface velocity at sym plane tadpoles colored vort isosurfaces (launch video fsi-boya-gpu-vorticity), (launch video gpu-corte) (launch video fsi-vort-3D), (launch video particles-buoyp5) CIMEC-CONICET-UNL 43 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 44. Advances in NS solver on GPU por M.Storti et.al. FVM/IBM vs. LBM on GPGPU ā€¢ In LBM the Poisson equation is not solved; it uses pseudo-compressibility, hence LBM is a true Cellular Automata (CA) algorithm. (FVM/IBM could so too, of course ) ā€¢ So comparatively it can reach higher computing rates, but suffers from the artiļ¬cial Mach penalization effect. ā€¢ Suppose you have a maximum velocity umax in the ļ¬‚ow ļ¬eld. ā€¢ If you use pseudo-compressibility then you must choose an artiļ¬cial speed of sound cart > umax. ā€¢ The compressibility effects are O(M2 art) (Mart = umax/cart) so you choose, letā€™s say, Mart = 0.1 so that the error is O(10āˆ’2 ). āˆ†t < CFLcrit Ā· h/(umax + cart) = Mart/(1 + Mart) Ā· h/umax ā‰ˆ Mart Ā· h/umax, CFL < Mart Ā· CFLcrit, ā€¢ So your CFL is penalized by a factor Mart (0.1 in the example). CIMEC-CONICET-UNL 44 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 45. Advances in NS solver on GPU por M.Storti et.al. FVM/IBM vs. LBM on GPGPU (cont.) ā€¢ So in fact not only Mcell/s must be reported, but also the critical CFL for the algorithm. ā€¢ So if the computing rate in Mcell/s is doubled but have a penalization of 0.5x in the CFL the algorithm breaks even. ā€¢ Or rather the effective computing rate should be reported effective computing rate = CFLcrit Ā· (number of cells) (elapsed time) ā€¢ The presented algorithm has (experimentally determined) CFLcrit = 0.5 āˆ’ 0.6. CIMEC-CONICET-UNL 45 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 46. Advances in NS solver on GPU por M.Storti et.al. Future work ā€¢ We are in the process of buying a HPC cluster (ā‰ˆ 20 Tļ¬‚ops, w/o GPGPUs) with funds from Provincia de Santa Fe Science Agency ASACTEI (Thanks! ). Includes nodes w/GTX Titan Z (2x2880=5760 cores @876 MHz (boost), 12 GiB RAM, equiv to 2 Titan (not Black! ) on the same card). We estimate reaching 2 Gcell/s on one card. ā€¢ Replace momentum eqs. solver with MOC+BFECC. ā€¢ Multi-GPU: (hybrid MPI+CUDA+OpenMP) Some experiments already done with Poisson eq. ā€¢ Reļ¬nement with multi-block strategy. ā€¢ LES Smagorinsky turbulence model. CIMEC-CONICET-UNL 46 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))
  • 47. Advances in NS solver on GPU por M.Storti et.al. Acknowledgments ā€¢ Chilean Council for Scientiļ¬c and Technological Research (CONICYT-FONDECYT 1130278); ā€¢ The Scientiļ¬c Research Projects Management Department of the Vice Presidency of Research, Development and Innovation (DICYT-VRID) at Universidad de Santiago de Chile; ā€¢ Association of Universities: Montevideo Group (AUGM); ā€¢ Argentinean Council for Scientiļ¬c Research (CONICET project PIP 112-20111-00978); ā€¢ Argentinean National Agency for Technological and Scientiļ¬c Promotion, ANPCyT, (grants PICT-E-2014-0191, PICT-2014-0191); ā€¢ Universidad Nacional del Litoral, Argentina (grant CAI+D-501-201101-00233-LI); ā€¢ Santa Fe Science Technology and Innovation Agency (grant ASACTEI-010-18-2014); and ā€¢ European Research Council (ERC) Advanced Grant Real Time Computational Mechanics Techniques for Multi-Fluid Problems (REALTIME, Reference: ERC-2009-AdG). ā€¢ We made extensive use of Free Software as GNU/Linux OS, GCC/G++ compilers, Octave, and Open Source software as VTK, ParaView, among many others. CIMEC-CONICET-UNL 47 ((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))