SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Advances in NS solver on GPU por M.Storti et.al. 
Advances in the Solution of NS Eqs. in GPGPU 
Hardware. Modelling Fluid Structure Interaction 
for a Submerged Spherical Buoy 
by Mario Stortiab, Santiago Costarelliab, Luciano Garellia, 
Marcela Cruchagac, Ronald Ausensic, S. Idelsohnabde 
aCentro de Investigaci ´on de M´etodos Computacionales, CIMEC (CONICET-UNL) 
Santa Fe, Argentina, mario.storti@gmail.com, www.cimec.org.ar/mstorti 
bFacultad de Ingenier´ıa y Ciencias H´ıdricas. UN Litoral. Santa Fe, Argentina, fich.unl.edu.ar 
cDepto Ing. Mec´anica, Univ Santiago de Chile 
dInternational Center for Numerical Methods in Engineering (CIMNE) 
Technical University of Catalonia (UPC), www.cimne.com 
eInstituci ´o Catalana de Recerca i Estudis Avanc¸ats 
(ICREA), Barcelona, Spain, www.icrea.cat 
CIMEC-CONICET-UNL 1 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Scientific computing on GPUs 
 Graphics Processing Units (GPU’s) are specialized hardware desgined to 
discharge computation from the CPU for intensive graphics applications. 
 They have many cores (thread processors), currently the Tesla K40 
(Kepler GK110) has 2880 cores at 745 Mhz (Builtin boost to 810, 875Mhz). 
 The raw computing power 
is in the order of Teraflops 
(4.3 Tflops in SP and 
1.43 Tflops in DP). 
 Memory Bandwidth 
(GDDR5) 288 GB/sec. 
Memory size 12 GB. 
 Cost USD 5,000. Low end 
version Geforce GTX Titan: 
USD 1000. 
CIMEC-CONICET-UNL 2 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Scientific computing on GPUs (cont.) 
 The difference between the GPUs 
architecture and standard 
multicore processors is that GPUs 
have much more computing units 
(ALU’s (Arithmetic-Logic Unit) and 
SFU’s (Special Function Unit), but 
few control units. 
 The programming model is SIMD 
(Single Instruction Multiple Data). 
 GPUs compete with many-core 
processors (e.g. Intel’s Xeon Phi) 
Knights-Corner, Xeon-Phi 60 
cores). 
if (COND) { 
BODY-TRUE; 
} else { 
BODY-FALSE; 
} 
CIMEC-CONICET-UNL 3 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Xeon Phi 
 In December 2012 Intel launched the Xeon Phi coprocessor card: 3100 and 
5110P. (2000 USD to 2600 USD). It has 60 cores with 22nm technology 
(clock speed 1GHz approx). “Supercomputer on a card” (SOC). 
 Today limitation is that(with 22nm technology) is that 5e9 transistors can 
be put on a sinle chip. Today Xeon processors have typically 2.5e9 
transistors. 
 Xeon Phi has 60 cores equivalent to the original Pentium processor (40e6 
transistors). 
CIMEC-CONICET-UNL 4 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Xeon Phi (cont.) 
 Xeon Phi is an alien computer. It fits in a PCI Express X 16 slot, and has its 
own basic Linux system. You can SSH to the card and run x86-64 code. 
Another workflow is to run the code in the host and send intensive 
computing tasks to the card (e.g. solving a linear system). 
 On January 2013 Texas Advanced Computing Center (TACC) added Xeon 
Phi’s to his Stampede supercomputer. Main CPUs are Xeon E5-2680. 128 
nodes have Nvidia Kepler K20 GPUs. Estimated performance 7.6 Pflops. 
 Tianhe-2 (China) the current fastes supercomputer (33.86 Pflops) includes 
also Xeon Phi coprocessors. 
 Part of Intel’s Many Integrated Core (MIC) architecture. Previous 
codenames for the project: Larrabee, Knights Ferry, Knights-Corner.) 
CIMEC-CONICET-UNL 5 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
We want to use GPUs... for what? 
Aerodynamics of a racing car 
Aerodynamics of an 
artillery projectile Flow patten in a harbour 
Dust particles in a coke 
processing facility 
Thermo fluid coupling 
Aerodynamics of a 
falling object 
Flutter at Ma=2.3 
(launch video showall) 
CIMEC-CONICET-UNL 6 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
GPUs in HPC 
 Some HPC people are skeptical 
about the efficient computing 
power of GPUs for scientific 
applications. 
 In many works speedup is referred 
to available CPU processors, which 
is not consistent. 
 Delivered speedup w.r.t. 
mainstream x86 processors is 
often much lower than expected. 
 Strict data parallelism is difficult to 
achieve on CFD applications. 
 Unfortunately, this idea is 
reinforced by the fact that GPUs 
come from the videogame special 
effects industry, not with scientific 
computing. 
CIMEC-CONICET-UNL 7 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Solution of incompressible Navier-Stokes flows on GPU 
 GPU’s are less efficient for algorithms that require access to the card’s 
(device) global memory. Shared memory is much faster but usually scarce 
(16K per thread block in the Tesla C1060) . 
 The best algorithms are those that make computations for one cell 
requiring only information on that cell and their neighbors. These 
algorithms are classified as cellular automata (CA). 
 Lattice-Boltzmann and explicit F?M (FDM/FVM/FEM) fall in this category. 
 Structured meshes require less data to exchange between cells (e.g. 
neighbor indices are computed, no stored), and so, they require less 
shared memory. Also, very fast solvers like FFT-based (Fast Fourier 
Transform) or Geometric Multigrid are available . 
CIMEC-CONICET-UNL 8 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Fractional Step Method on structured grids with QUICK 
Proposed by Molemaker et.al. SCA’08: 2008 ACM SIGGRAPH, Low viscosity 
flow simulations for animation. 
 Fractional Step Method 
(a.k.a. pressure 
segregation) 
 u; v;w and continuity 
cells are staggered 
(MAC=Marker And 
Cell). 
 QUICK advection 
scheme is used in the 
predictor stage. 
 Poisson system is 
solved with IOP 
(Iterated Orthogonal 
Projection) (to be 
described later), on top 
of Geometric MultiGrid 
j+2 
j+1 
j 
Uij Pij 
Vij 
(ih,jh) i+1 i+2 
x 
y 
i 
y-momentum cell 
x-momentum nodes 
y-momentum nodes 
x-momentum cell continuity nodes 
continuity cell 
CIMEC-CONICET-UNL 9 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Solution of the Poisson with FFT 
 Solution of the Poisson equation is, for large meshes, the more CPU 
consuming time stage in Fractional-Step like Navier-Stokes solvers. 
 We have to solve a linear system Ax = b 
 The Discrete Fourier Transform (DFT) is an orthogonal transformation 
^x = Ox = t(x). 
 The inverse transformation O1 = OT is the inverse Fourier Transform 
x = OT ^x = it(^x). 
 If the operator matrix A is spatially invariant (i.e. the stencil is the same at 
all grid points) and the b.c.’s are periodic, then it can be shown that O 
diagonalizes A, i.e. OAO1 = D. 
 So in the transformed basis the system of equations is diagonal 
(OAO1) (Ox) = (Ob); 
D^x = ^ b; 
(1) 
 For N = 2p the Fast Fourier Transform (FFT) is an algorithm that 
computes the DFT (and its inverse) in O(N log(N)) operations. 
CIMEC-CONICET-UNL 10 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Solution of the Poisson with FFT (cont.) 
 So the following algorithm computes the solution of the system in 
O(N log(N)) ops. 
. ^b 
= t(b); (transform r.h.s) 
. ^x = D1^ b, (solve diagonal system O(N)) 
. x = it(^x), (anti-transform to get the sol. vector) 
 Total cost: 2 FFT’s, plus one element-by-element vector multiply (the 
reciprocals of the values of the diagonal of D are precomputed) 
 In order to precompute the diagonal values of D, 
. We take any vector z and compute y = Az, 
. then transform ^z = t(z), ^y = t(y), 
. Djj = ^yj=^zj . 
CIMEC-CONICET-UNL 11 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Solution of the Poisson equation on embedded geometries 
 FFT solver and GMG are very fast but have several restrictions: invariance 
of translation, periodic boundary conditions. They are not well suited for 
embedded geometries. 
 One approach for the solution is the IOP (Iterated Orthogonal Projection) 
algorithm. 
 It is based on solving iteratively the Poisson eq. on the whole domain 
(fluid+solid). Solving in the whole domain is fast, because algorithms like 
Geometric Multigrid or FFT can be used. Also, they are very efficient 
running on GPU’s . 
 However, if we solve in the whole domain, then we can’t enforce the 
boundary condition (@p=@n) = 0 at the solid boundary which, then 
means the violation of the condition of impenetrability at the solid 
boundary . 
CIMEC-CONICET-UNL 12 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
The IOP (Iterated Orthogonal Projection) method 
The method is based on succesively solve for the incompressibility condition 
(on the whole domain: solid+fluid), and impose the boundary condition. 
on the whole 
domain (fluid+solid) 
violates impenetrability b.c. satisfies impenetrability b.c. 
CIMEC-CONICET-UNL 13 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
The IOP (Iterated Orthogonal Projection) method (cont.) 
 Fixed point iteration 
wk+1 = bdydivwk: 
 Projection on the space of 
divergence-free velocity fields: 
u0 = div(u) 
( 
u0 = u  rP; 
P = r  u; 
 Projection on the space of velocity 
fields that satisfy the 
impenetrability boundary condition 
u00 = bdy(u0) 
( 
u00 = ubdy; in 
bdy; 
u00 = u0; in 

uid: 
CIMEC-CONICET-UNL 14 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Implementation details on the GPU 
 We use the CUFFT 
library. 
 Per iteration: 2 FFT’s 
and Poisson residual 
evaluation. The FFT on 
the GPU Tesla C1060 
performs at 27 Gflops, 
(in double precision) 
where the operations 
are counted as 
5N log2(N). 
30 
25 
20 
15 
10 
5 
0 
10^{3} 10^{4} 10^{5} 10^{6} 10^{7} 10^{8} 
FFT computing rate [Gflops] 
Vector size 
CIMEC-CONICET-UNL 15 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FFT computing rates in GPGPU. GTX-580 
250 
200 
150 
100 
50 
0 
128x64x64 
simple precision 
double precision 
105 106 107 
FFT proc. rate [Gflops/sec] 
Ncell 
64x64x64 
128x128x64 
128x128x128 
256x128x128 256x256x128 
CIMEC-CONICET-UNL 16 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FFTW on i7-3820@3.60Ghz (Sandy Bridge) 
25000 
20000 
FFT proc. rate [Mflops] nthreads=1 
15000 
10000 
5000 
0 
nthreads=4 
nthreads=2 
103 104 105 106 107 108 109 
Ncell 
CIMEC-CONICET-UNL 17 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
NSFVM Computing rates in GPGPU. Scaling 
160 
140 
120 
100 
80 
60 
40 
20 
C2050-SP 
10-1 100 101 102 # of cells [Mcell] 
rate [Mcell/sec] 
GTX-580 SP 
GTX-580 DP 
C2050-DP 
64x64x64 
128x64x64 
128x128x64 
128x128x128 
256x128x128 
256x256x128 
256x256x256 
CIMEC-CONICET-UNL 18 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
NSFVM and “Real Time” computing 
 For a 128x128x128 mesh ( 2Mcell), we have a computing time of 
2 Mcell/(140 Mcell/sec) = 0.014 secs/time step. 
 That means 70 steps/sec. 
 A von Neumann stability analysis shows that the QUICK stabilization 
scheme is inconditionally stable if advanced in time with Forward Euler. 
 With a second order Adams-Bashfort scheme the critical CFL is 0.588. 
 For NS eqs. the critical CFL has been found to be somewhat lower ( 0.5). 
 If L = 1, u = 1, h = 1=128, t = 0:5h=u = 0:004 [sec], so that we can 
compute in 1 sec, 0.28 secs of simulation time. We say ST/RT=0.28. 
 In the GTX Titan Black we expect to run 3.5x times faster, and has the 
double of memory (6 GB versus 3 GB). We expect that it would allow run 
with up to 12 Mcells (currently 6 Mcells in the GTX 580). 
(launch video nsfvm-bodies-all), 
CIMEC-CONICET-UNL 19 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
NSFVM and “Real Time” computing (cont.) 
CIMEC-CONICET-UNL 20 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Computing times in GPGPU. Fractional Step components 
50 
40 
30 
20 
10 
0 
predictor comp-div CG-FFT-solver corrector 
time participation in total computing time [%] 
128x128x128 mesh 
256x256x128 mesh 
GTX 580 (Simple precision) 
CIMEC-CONICET-UNL 21 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Current work 
Current work is done in the following directions 
 Improving performance by replacing the QUICK advection scheme by 
MOC+BFECC (which could be more GPU-friendly). 
 Implementing a CPU-based renormalization algorithm for free surface 
(level-set) flows. 
 Another important issue is improving the representation (accuracy) of the 
solid body surface by using an immersed boundary technique. 
CIMEC-CONICET-UNL 22 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Why leave QUICK? 
 One of steps of the Fractional Steps Method is the advection step. We 
have to advect the velocity field and we desire a method as less diffusive 
as possible, and that allows as large the CFL number as possible. 
 Also, of course, we want a GPU friendly algorithm. 
 Previously we used QUICK, but it has a stencil that extends more than one 
cell in the upwind direction. This increases shared memory usage and 
data transfer. We seek for another low dissipation scheme with a more 
compact stencil. 
CIMEC-CONICET-UNL 23 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Quick advection scheme 
1D Scalar advection diffusion: a= advection velocity,  advected scalar. 
@ 
@x
(a) 
i+1=2 
 
(aQ)i+1  (aQ)i 
x 
; 
Q 
i = 
8 : 
3=8i+1=2 
+ 6=8i1=2 
 1=8i3=2 
; if a  0; 
3=8ui1=2 
+ 6=8ui+1=2 
 1=8ui+3=2 
; if a  0; 
x 
control volume cell 
i-3/2 i-1/2 i+1/2 i+3/2 i+5/2 
i i+1 
CIMEC-CONICET-UNL 24 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Method Of Characteristics (MOC) 
 The Method Of Characteristics (MOC) consists in tracking the position of 
the node following the characteristics to the position it had at time tn and 
taking its value there, 
(xn+1; tn+1) = (xn; tn) 
If xn doesn’t happen to be a mesh node it involves a projection. 
 It’s the basis of the Lagrangian methods for dealing with advection terms. 
CIMEC-CONICET-UNL 25 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Method Of Characteristics (MOC) (cont.) 
 So typically MOC has very low diffusion if CFL is an integer number , 
and too diffusive if it is an semi-integer number . 
 Of course, in the general case (non uniform meshes, non uniform velocity 
field) we can’t manage to have an integer CFL number for all the nodes. 
(launch video video-moc-cfl1), (launch video video-moc-cfl05). 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
u(t=0) 
0 1 2 3 4 CFL 5 
max(u) 
U=1 
u(t=10) 
L=1 
periodic 
MOC 
BFECC+MOC 
CIMEC-CONICET-UNL 26 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
MOC+BFECC 
 Assume we have a low order (dissipative) operator (may be SUPG, MOC, 
or any other) t+t = L(t; u). 
 The Back and Forth Error Compensation and Correction (BFECC) allows 
to eliminate the dissipation error. 
. Advance forward the state t+t; = L(t; u). 
. Advance backwards the state t; = L(t+t;;u). 
exact w/dissipation error 
CIMEC-CONICET-UNL 27 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
MOC+BFECC (cont.) 
exact w/dissipation error 
 If L introduces some dissipative error , then t;6= t, in fact 
t; = t + 2. 
 So that we can compensate for the error: 
t+t = L(t;t)  ; 
= t+t;  1=2(t;  t) 
(2) 
CIMEC-CONICET-UNL 28 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
MOC+BFECC (cont.) 
(launch video video-moc-bfecc-cfl05). 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
u(t=0) 
0 1 2 3 4 CFL 5 
max(u) 
U=1 
u(t=10) 
L=1 
periodic 
MOC 
BFECC+MOC 
CIMEC-CONICET-UNL 29 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
MOC+BFECC (cont.) 
Nbr of Cells QUICK-SP BFECC-SP QUICK-DP BFECC-DP 
64  64  64 29.09 12.38 15.9 5.23 
128  128  128 75.74 18.00 28.6 7.29 
192  192  192 78.32 17.81 30.3 7.52 
Cubic cavity. Computing rates for the whole NS solver (one step) in [Mcell/sec] obtained with the 
BFECC and QUICK algorithms on a NVIDIA GTX 580. 3 Poisson iterations were used. 
[jump to Conclusions] 
CIMEC-CONICET-UNL 30 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Analysis of performance 
 Regarding the performance results shown above, it can be seen that the 
computing rate of QUICK is at most 4x faster than that of BFECC. So 
BFECC is more efficient than QUICK whenever used with CFL  2, being 
the critical CFL for QUICK 0.5. The CFL used in our simulations is typically 
CFL  5 and, thus, at this CFL the BFECC version runs 2.5 times faster 
than the QUICK version. 
 The speedup of MOC+BFECC versus QUICK increases with the number of 
Poisson iterations. In the limit of very large number of iters (very low 
tolerance in the tolerance for Poisson) we expect a speedup 10x (equal to 
the CFL ratio). 
CIMEC-CONICET-UNL 31 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Validation. Lid driven 3D cubic cavity 
 Re=1000, mesh of 128  128  128 (2 Mcell). Results compared with Ku 
et.al (JCP 70(42):439-462 (1987)). 
 More validation and complete performance study at Costarelli et.al, 
Cluster Computing (2013), DOI:10.1007/s10586-013-0329-9. 
1.2 
0.88 
0.56 
0.24 
−0.08 
−0.4 
0 0.2 0.4 0.6 0.8 1 
y 
w(y) 
−0.5 −0.34 −0.18 −0.02 0.14 0.3 
1 
0.8 
0.6 
0.4 
0.2 
0 
v(z) 
z 
Ku et al. 
NSFDM 
CIMEC-CONICET-UNL 32 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Renormalization 
Even with a high precision, low dissipative algorithm for transporting the level 
set function  we have to renormalize  ! 0 with a certain frequency the 
level set function. 
 Requirements on the renormalization 
algorithm are: 
. 0 must preserve as much as posible 
the 0 level set function (interface) . 
. 0 must be as regular as possible near 
the interface. 
. 0 must have a high slope near the 
interface. 
. Usually the signed distance function is 
used, i.e. 
0(x) = sign((x)) min 
y2 
Φ (renormalized) 
jjy  xjj Φ x 
Γ 
Γ Γ 
CIMEC-CONICET-UNL 33 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Renormalization (cont.) 
 Computing plainly the distance 
function is O(NN) where N is the 
number of points on the interface. This 
scales typically / N1+(nd1)=nd 
(N5=3 in 3D). 
 Many variants are based in solving the 
Eikonal equation 
jrj = 1; 
 As it is an hyperbolic equation it can 
be solved by a marching technique. 
The algorithm traverses the domain 
with an advancing front starting from 
the level set. 
 However, it can develop caustics 
(shocks), and rarefaction waves. So, 
an entropy condition must be enforced. 
caustic 
expansion fan 
CIMEC-CONICET-UNL 34 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Renormalization (cont.) 
 The Fast Marching algorithm 
proposed by Sethian (Proc Nat 
Acad Sci 93(4):1591-1595 (1996)) , 
is a fast (near optimal) algorithm 
based on Dijkstra’s algorithm for 
computing minimum distances in 
graphs from a source set. (Note: 
the original Dijkstra’s algorithm is 
O(N2), not fast. The fast version 
using a priority queue is due to 
Fredman and Tarjan (ACM Journal 
24(3):596-615, 1987), and the 
complexity is 
O(N log(jQj))  O(N log(N))). 
Level set 
Q=advancing front F=far-away 
CIMEC-CONICET-UNL 35 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
The Fast Marching algorithm 
 We explain for the positive part   0. 
Then the algorithm is reversed for   0. 
 All nodes are in either: Q=advancing front, 
F=far-away , I=frozen/inactive. The 
advancing front sweeps the domain 
starting at the level set and converts F 
nodes to I. 
X 
 Initially Q = fnodes that are in contact 
with the level setg. Their distance to the 
interface is computed for each cut-cell. 
The rest is in F =far-away. 
 loop: Take the node X in Q closest to the 
interface. Move it from Q ! I. 
Level set 
 Update all distances from neighbors to X 
and move them from F ! Q. 
 Go to loop. 
 Algorithm ends when Q = ;. Q=advancing front F=far-away 
I= frozen/inactive 
CIMEC-CONICET-UNL 36 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FastMarch: error and regularity of the distance function 
 Numerical example shows 
regularity of computed distance 
function in a mesh of 100x100. 
 We have a LS consisting of a circle 
R = 0:2 inside a square of L = 1. 
  is shown along the x = 0:6 cut 
of the geometry, also we show the 
first and second derivatives. 
  deviates less than 103 from the 
analytical distance. 
 Small spikes are observed in the 
second derivative. 
 The error   ex shows the 
discontinuity in the slope at the LS. 
0.1 
0 
-0.1 
-0.2 
-0.3 
-0.4 
X 
R=0.2 
Φ 
0 0.2 0.4 0.6 0.8 1 
1 
0.5 
0 
-0.5 
-1 
dΦ/dx 
L=1 
0 0.2 0.4 0.6 0.8 1 
0 
-5 
-10 
-15 
-20 
d2Φ/dx2 
0 0.2 0.4 0.6 0.8 1 
0 
-0.0002 
-0.0004 
-0.0006 
-0.0008 
-0.001 
|Φ-Φex| 
0 0.2 0.4 0.6 0.8 1 
CIMEC-CONICET-UNL 37 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FastMarch: implementation details 
 Complexity is O(N) the cost of finding the node in Q closest to the 
level set. 
 This can be implemented in a very efficient way with a priority queue 
implemented in top of a heap. In this way finding the closest node is 
O(log jQj). So the total cost is 
O(N log jQj)  O(N log(N 
(nd1)=nd )) = O(N logN2=3) (in 3D). 
 The standard C++ class priority_queue is not appropriate 
because don’t give access to the elements in the queue. 
 We implemented the heap structure on top of a vector and an 
unordered_map (hash-table based) that tracks the Q-nodes in the 
structure. The hash function used is very simple. 
CIMEC-CONICET-UNL 38 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FastMarch renorm: Efficiency 
 The Fast Marching algorithm is 
O(N log jQj) where N is the 
number of cells and jQj the size of 
the advancing front. 
 Rates were evaluated in an Intel 
i7-950@3.07 (Nehalem). 
 Computing rate is practically 
constant and even decreases with 
high N. 
 Since the rate for the NS-FVM 
algorithm is 100 [Mcell/s], 
renormalization at a frequency 
greater than 1/200 steps would be 
too expensive. 
 Cost of renormalization step is 
reduced with band renormalization 
and parallelism (SMP). 
N (Nbr of cells per side) 
processing rate [Mcells/s] 
2048x2048 
1024x1024 
512x512 
256x256 
128x128 
64x64 
32x32 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
103 104 105 106 107 
CIMEC-CONICET-UNL 39 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FastMarch renorm: band renormalization 
 The renormalization algorithm 
doesn’t need to cover the whole 
domain. Only a band around the 
level set (interface) is needed. 
 The algorithm is modified simply: 
set distance in far-away nodes to 
d = dmax. 
 Cost is proportional to the volume 
of the band, i.e.: 
Vband = Sband  2dmax / dmax. 
 Low dmax reduces cost, but 
increases the probability of forcing 
a new renormalization, and thus 
increasing the renormalization 
frequency. 
Φ (renormalized w/o dmax) 
x Φold 
dmax 
dmax 
Φ (renormalized w/ dmax) 
width of renormalization 
band = 2 dmax 
CIMEC-CONICET-UNL 40 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FastMarch renorm: Parallelization 
How to parallelize FastMarch? We can 
do speculative parallelism that is while 
processing a node X at the top of the 
heap, we can process in parallel the 
following node Y , speculating that 
most of the time node Y will be far 
from X and then can be processed 
independently. This can be checked 
afterwards, using time-stamps for 
instance. 
23 16 10 19 21 
7 5 
Level set 
0 
1 
Y 
2 
3 
X 
4 6 
12 
8 
17 
14 
9 
13 
20 18 11 15 22 
All nodes of the same color can be 
processed at the same time. 
CIMEC-CONICET-UNL 41 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FastMarch renorm: Parallelization (cont.) 
 How much nodes can be 
processed concurrently? It 
turns out that the 
simultaneity (number of 
nodes that can be 
processed simultaneously) 
grows linearly with 
refinement. 
 Average simultaneity is 
16x16: 11.358 
32x32: 20.507 
 Percentage of times 
simultaneity is 4: 
16x16: 93.0% 
32x32: 98.0% 
80 
70 
60 
50 
40 
30 
20 
10 
0 
0 0.2 0.4 0.6 0.8 1 
numer of independent points 
% advance front 
80 
70 
60 
50 
40 
30 
20 
10 
0 
0 0.2 0.4 0.6 0.8 1 
% front nodes 
numer of independent points 
CIMEC-CONICET-UNL 42 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
FastMarching: computational budget 
 With band renormalization and SMP parallelization we expect a rate of 
20 Mcell/s. 
 That means that a 1283 mesh (2 Mcell) can be done in 100 ms. 
 This is 7x times the time required for one time step (14 ms). 
 Renormalization will be amortized if the renormalization frequency is more 
than 1/20 time steps. 
 Transfer of the data to and from the processor through the PCI Express 2.0 
x 16 channel (4 GB/s transfer rate) is in the order of 10 ms. 
 BTW: note that transfers from the CPU to/from the card are amortized if 
they are performed each 1:10 steps or so. Such transfers can’t be done all 
time steps. 
CIMEC-CONICET-UNL 43 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Fluid structure interaction 
 In the previous examples we have shown several cases of rigid solid 
bodies moving immersed in a fluid. In those cases the position of the 
body is prescribed beforehand. 
 A more complex case is when the position of the body results from the 
interaction with the fluid. 
 In general if the body is not rigid we have to solve the elasticity equations 
in the body 
bdy(t). We assume here that the body is rigid, so that we 
just solve for the 6 d.o.f. of the rigid with the linear and angular 
momentum of the body. 
CIMEC-CONICET-UNL 44 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Fluid structure in the real world 
Three Gorges (PR China) dam 
Francis Turbine 
Belomone Francis Turbine 
installation 
Sayano–Shushenskaya (Russia) Dam 
Machine room prior to 2009 accident 
Sayano–Shushenskaya (Russia) Dam 
Machine room after 2009 accident 
 IMPSA (Industria Metal ´ urgica Pescarmona) is manufacturing three Francis 
turbines for the Belo Monte (Brazil) dam. Once in operation it would be the 
3rd largest dam in the world. 
 Ignition of a rocket engine (launch video video-simulacion-motor-cohete). 
CIMEC-CONICET-UNL 45 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Fluid structure in the real world (cont.) 
 CIMEC computed the rotordynamic 
coefficients, that is a reduced parameter 
representation of the fluid in the gap 
between stator and rotor. (launch video 
movecyl-tadpoles) 
 The rotordynamic coefficientes represent 
the forces of the fluid on to the rotor as the 
rotor orbits around its centered position. 
They are used as a component of the 
design of the structural part of the turbine. 
 They represent the added mass, damping, 
and stiffness (a.k.a. Lomakin effect). 
 The geometry is very complex: a circular 
corona of diameter 8m, gap 3mm, and 
height 2m. 
CIMEC-CONICET-UNL 46 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
A buoyant fully submerged sphere 
 Fluid is water 
uid = 1000 [kg=m3]. 
 The body is a smooth sphere D =10cm, fully 
submerged in a box of L =39cm. Solid 
density is bdy = 366 [kg=m3]  0:4 
uid 
so the body has positive buoyancy. The buoy 
(sphere) is tied by a string to an anchor point 
at the center of the bottom side of the box. 
 Ths box is subject to horizontal harmonic 
oscillations 
xbox = Abox sin(!t); 
CIMEC-CONICET-UNL 47 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Spring-mass-damper oscillator 
 The acceleration of the box creates a driving force on the sphere. Note 
that as the sphere is positively buoyant the driving force is in phase with 
the displacement. That means that if the box accelerates to the right, then 
the sphere tends to move in the same direction. (launch video p2-f09) 
 The buoyancy of the sphere produces a restoring force that tends to keep 
aligned the string with the vertical direction. 
 The mass of the sphere plus the added mass of the liquid gives inertia. 
 The drag of the fluid gives a damping effect. 
 So the system can be approximated as a (nonlinear) spring-mass-damper 
oscillator 
inertia z }| { 
mtotx+ 
damp z }| { 
0:5
CdAjvjx_ + 
spring z }| { 
(m
otg=L)x = 
driving frc z }| { 
m
otabox; 
mtot = mbdy + madd; m
ot = md  mbdy; 
madd = Cmaddmd; Cmadd  0:5; md = 
Vbdy 
CIMEC-CONICET-UNL 48 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Experimental results 
 Experiments carried out atUniv Santiago de 
Chile (Dra. Marcela Cruchaga). 
 Plexiglas box mounted on a shake table, 
subject to oscillations of 2cm amplitude in 
freqs 0.5-1.5 Hz. 
 Sphere position is obtained using Motion 
Capture (MoCap) from high speed (HS) camera 
(800x800 pixels @ 120fps). HS cam is AOS 
Q-PRI, can do 3 MPixel @500fps, max frame 
rate is 100’000 fps. 
 MoCap is performed with a home-made code. 
Finds current position by minimization of a 
functional. Can also detect rotations. In this 
exp setup 1 pixel=0.5mm (D=200 px). Camera 
distortion and refraction have been evaluated 
to be negligible. MoCap resolution can be less 
than one pixel. (launch video mocap2) 
CIMEC-CONICET-UNL 49 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Experimental results (cont.) 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 
phase[deg] 
freq[Hz] 
anly(CD=0.3,Cmadd=0.55) 
exp 
0.1 
0.08 
0.06 
0.04 
0.02 
0 
0.4 0.6 0.8 1 1.2 1.4 1.6 
Amplitude[m] 
freq[Hz] 
anly(CD=0.3,Cmadd=0.55) 
exp 
Best fit of the experimental results with the simple spring-mass-damper 
oscillator are obtained for Cmadd = 0:55 and C = d = 0:3. Reference values 
are: 
 Cmadd = 0:5 (potential flow), Cmadd = 0:58 (Neill etal Amer J Phys 
2013), 
 CD=0.47 (reference value). (launch video description-buoy-resoviz) 
CIMEC-CONICET-UNL 50 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Computational FSI model 
 The fluid structure interaction is performed with a partitioned Fluid 
Structure Interaction (FSI) technique. 
 Fluid is solved in step [tn; tn+1 = tn + t] between the known position 
xn of the body at tn and an extrapolated (predicted) position xn+1;P at 
tn+1. 
 Then fluid forces (pressure and viscous tractions) are computed and the 
rigid body equations are solved to compute the true position of the solid 
xn+1. 
 The scheme is second order precision. It can be put as a fixed point 
iteration and iterated to convergence (enhancing stability). 
 The rigid body eqs. are now: 
mbdyx + (m
otg=L)x = F
uid + m
otabox 
where F
uid = Fdrag + Fadd are now computed with the CFD-NS-FVM 
code. 
CIMEC-CONICET-UNL 51 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
Advances in NS solver on GPU por M.Storti et.al. 
Why to use a positive buoyant body? 
 Gov. equations for the body: 
x + (m
otg=L)x = 
comp w=CFD z }| { 
F
uid(fxg) +m
otabox 
where F
uid = Fdrag + Fadd are fluid forces computed with the 
CFD-NS-FVM code. 
 The whole methodology (to compute the response curves amplitude vs. 
frequency) and compare with experiments is a means to validate the 
precision of the numerical model. 
 The most important and harder from the point of view of the numerical 
method is to compute the the fluid forces (drag and added mass). 
Buoyancy is almost trivial and can be easily well captured of factored out 
from the simulation. 
 Then, if the body is too heavy it’s not affected by the fluid forces, so we 
prefer a body as lighter as possible. 
 In the present example mbdy = 0:192[kg], md = 0:523[kg], 
madd = 0:288[kg] 
CIMEC-CONICET-UNL 52 
((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))

Contenu connexe

Tendances

AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH dongwook159
 
PhD defense talk (portfolio of my expertise)
PhD defense talk (portfolio of my expertise)PhD defense talk (portfolio of my expertise)
PhD defense talk (portfolio of my expertise)Gernot Ziegler
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverPraveen Narayanan
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Takahiro Harada
 
Clock mesh sizing slides
Clock mesh sizing slidesClock mesh sizing slides
Clock mesh sizing slidesRajesh M
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptgrssieee
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Tom Hubregtsen
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711RCCSRENKEI
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Jumlesha Shaik
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationMohammed Bennamoun
 
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering WorkflowTakahiro Harada
 
OTOY GTC17 Presentation Slides: "The Future of GPU Rendering"
OTOY GTC17 Presentation Slides: "The Future of GPU Rendering"OTOY GTC17 Presentation Slides: "The Future of GPU Rendering"
OTOY GTC17 Presentation Slides: "The Future of GPU Rendering"OTOY Inc.
 
OTOY Presentation - 2015 NVIDIA GPU Technology Conference - March 17 2015
OTOY Presentation - 2015 NVIDIA GPU Technology Conference - March 17 2015OTOY Presentation - 2015 NVIDIA GPU Technology Conference - March 17 2015
OTOY Presentation - 2015 NVIDIA GPU Technology Conference - March 17 2015otoyinc
 

Tendances (19)

AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
 
PhD defense talk (portfolio of my expertise)
PhD defense talk (portfolio of my expertise)PhD defense talk (portfolio of my expertise)
PhD defense talk (portfolio of my expertise)
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Clock mesh sizing slides
Clock mesh sizing slidesClock mesh sizing slides
Clock mesh sizing slides
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
Aritra Sarkar - Search and Optimisation Algorithms for Genomics on Quantum Ac...
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimization
 
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
 
OTOY GTC17 Presentation Slides: "The Future of GPU Rendering"
OTOY GTC17 Presentation Slides: "The Future of GPU Rendering"OTOY GTC17 Presentation Slides: "The Future of GPU Rendering"
OTOY GTC17 Presentation Slides: "The Future of GPU Rendering"
 
OTOY Presentation - 2015 NVIDIA GPU Technology Conference - March 17 2015
OTOY Presentation - 2015 NVIDIA GPU Technology Conference - March 17 2015OTOY Presentation - 2015 NVIDIA GPU Technology Conference - March 17 2015
OTOY Presentation - 2015 NVIDIA GPU Technology Conference - March 17 2015
 

En vedette

CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architectureCHIHTE LU
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU ArchitectureMark Kilgard
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 

En vedette (20)

CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU Architecture
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 

Similaire à Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling Fluid Structure Interaction for a Submerged Spherical Buoy

Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...AIRCC Publishing Corporation
 
GPUs for GEC Competition @ GECCO-2013
GPUs for GEC Competition @ GECCO-2013GPUs for GEC Competition @ GECCO-2013
GPUs for GEC Competition @ GECCO-2013Daniele Loiacono
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano
 
Multi-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generationMulti-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generationMahesh Khadatare
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd Iaetsd
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1kr0y
 
Parallel implementation of pulse compression method on a multi-core digital ...
Parallel implementation of pulse compression method on  a multi-core digital ...Parallel implementation of pulse compression method on  a multi-core digital ...
Parallel implementation of pulse compression method on a multi-core digital ...IJECEIAES
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Nvidia in bioinformatics
Nvidia in bioinformaticsNvidia in bioinformatics
Nvidia in bioinformaticsShanker Trivedi
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)PhtRaveller
 
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCarlos Reaño González
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureIJMER
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaJohanAspro
 
2022_01_TSMP_HPC_Asia_Final.pptx
2022_01_TSMP_HPC_Asia_Final.pptx2022_01_TSMP_HPC_Asia_Final.pptx
2022_01_TSMP_HPC_Asia_Final.pptxGhazal Tashakor
 

Similaire à Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling Fluid Structure Interaction for a Submerged Spherical Buoy (20)

Gpuslides
GpuslidesGpuslides
Gpuslides
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
 
GPUs for GEC Competition @ GECCO-2013
GPUs for GEC Competition @ GECCO-2013GPUs for GEC Competition @ GECCO-2013
GPUs for GEC Competition @ GECCO-2013
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Multi-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generationMulti-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generation
 
6. Implementation
6. Implementation6. Implementation
6. Implementation
 
Iaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformationIaetsd pipelined parallel fft architecture through folding transformation
Iaetsd pipelined parallel fft architecture through folding transformation
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1
 
Parallel implementation of pulse compression method on a multi-core digital ...
Parallel implementation of pulse compression method on  a multi-core digital ...Parallel implementation of pulse compression method on  a multi-core digital ...
Parallel implementation of pulse compression method on a multi-core digital ...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Nvidia in bioinformatics
Nvidia in bioinformaticsNvidia in bioinformatics
Nvidia in bioinformatics
 
AES on modern GPUs
AES on modern GPUsAES on modern GPUs
AES on modern GPUs
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
 
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT Architecture
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ Karakola
 
2022_01_TSMP_HPC_Asia_Final.pptx
2022_01_TSMP_HPC_Asia_Final.pptx2022_01_TSMP_HPC_Asia_Final.pptx
2022_01_TSMP_HPC_Asia_Final.pptx
 

Plus de Storti Mario

Avances en la modelización computacional de problemas de interacción Fluido-E...
Avances en la modelización computacional de problemas de interacción Fluido-E...Avances en la modelización computacional de problemas de interacción Fluido-E...
Avances en la modelización computacional de problemas de interacción Fluido-E...Storti Mario
 
Programacion en C++ para Ciencia e Ingeniería
Programacion en C++ para Ciencia e IngenieríaProgramacion en C++ para Ciencia e Ingeniería
Programacion en C++ para Ciencia e IngenieríaStorti Mario
 
Air pulse fruit harvester
Air pulse fruit harvesterAir pulse fruit harvester
Air pulse fruit harvesterStorti Mario
 
Automatic high order absorption layers for advective-diffusive systems of equ...
Automatic high order absorption layers for advective-diffusive systems of equ...Automatic high order absorption layers for advective-diffusive systems of equ...
Automatic high order absorption layers for advective-diffusive systems of equ...Storti Mario
 
El Método Científico
El Método CientíficoEl Método Científico
El Método CientíficoStorti Mario
 
Supercomputadoras en carrera
Supercomputadoras en carreraSupercomputadoras en carrera
Supercomputadoras en carreraStorti Mario
 
Cosechadora Innovar
Cosechadora InnovarCosechadora Innovar
Cosechadora InnovarStorti Mario
 
HPC course on MPI, PETSC, and OpenMP
HPC course on MPI, PETSC, and OpenMPHPC course on MPI, PETSC, and OpenMP
HPC course on MPI, PETSC, and OpenMPStorti Mario
 
Dynamic Absorbing Boundary Conditions
Dynamic Absorbing Boundary ConditionsDynamic Absorbing Boundary Conditions
Dynamic Absorbing Boundary ConditionsStorti Mario
 
Actividades de Transferencia en CIMEC
Actividades de Transferencia en CIMECActividades de Transferencia en CIMEC
Actividades de Transferencia en CIMECStorti Mario
 
Algoritmos y Estructuras de Datos
Algoritmos y Estructuras de DatosAlgoritmos y Estructuras de Datos
Algoritmos y Estructuras de DatosStorti Mario
 

Plus de Storti Mario (11)

Avances en la modelización computacional de problemas de interacción Fluido-E...
Avances en la modelización computacional de problemas de interacción Fluido-E...Avances en la modelización computacional de problemas de interacción Fluido-E...
Avances en la modelización computacional de problemas de interacción Fluido-E...
 
Programacion en C++ para Ciencia e Ingeniería
Programacion en C++ para Ciencia e IngenieríaProgramacion en C++ para Ciencia e Ingeniería
Programacion en C++ para Ciencia e Ingeniería
 
Air pulse fruit harvester
Air pulse fruit harvesterAir pulse fruit harvester
Air pulse fruit harvester
 
Automatic high order absorption layers for advective-diffusive systems of equ...
Automatic high order absorption layers for advective-diffusive systems of equ...Automatic high order absorption layers for advective-diffusive systems of equ...
Automatic high order absorption layers for advective-diffusive systems of equ...
 
El Método Científico
El Método CientíficoEl Método Científico
El Método Científico
 
Supercomputadoras en carrera
Supercomputadoras en carreraSupercomputadoras en carrera
Supercomputadoras en carrera
 
Cosechadora Innovar
Cosechadora InnovarCosechadora Innovar
Cosechadora Innovar
 
HPC course on MPI, PETSC, and OpenMP
HPC course on MPI, PETSC, and OpenMPHPC course on MPI, PETSC, and OpenMP
HPC course on MPI, PETSC, and OpenMP
 
Dynamic Absorbing Boundary Conditions
Dynamic Absorbing Boundary ConditionsDynamic Absorbing Boundary Conditions
Dynamic Absorbing Boundary Conditions
 
Actividades de Transferencia en CIMEC
Actividades de Transferencia en CIMECActividades de Transferencia en CIMEC
Actividades de Transferencia en CIMEC
 
Algoritmos y Estructuras de Datos
Algoritmos y Estructuras de DatosAlgoritmos y Estructuras de Datos
Algoritmos y Estructuras de Datos
 

Dernier

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 

Dernier (20)

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 

Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling Fluid Structure Interaction for a Submerged Spherical Buoy

  • 1. Advances in NS solver on GPU por M.Storti et.al. Advances in the Solution of NS Eqs. in GPGPU Hardware. Modelling Fluid Structure Interaction for a Submerged Spherical Buoy by Mario Stortiab, Santiago Costarelliab, Luciano Garellia, Marcela Cruchagac, Ronald Ausensic, S. Idelsohnabde aCentro de Investigaci ´on de M´etodos Computacionales, CIMEC (CONICET-UNL) Santa Fe, Argentina, mario.storti@gmail.com, www.cimec.org.ar/mstorti bFacultad de Ingenier´ıa y Ciencias H´ıdricas. UN Litoral. Santa Fe, Argentina, fich.unl.edu.ar cDepto Ing. Mec´anica, Univ Santiago de Chile dInternational Center for Numerical Methods in Engineering (CIMNE) Technical University of Catalonia (UPC), www.cimne.com eInstituci ´o Catalana de Recerca i Estudis Avanc¸ats (ICREA), Barcelona, Spain, www.icrea.cat CIMEC-CONICET-UNL 1 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 2. Advances in NS solver on GPU por M.Storti et.al. Scientific computing on GPUs Graphics Processing Units (GPU’s) are specialized hardware desgined to discharge computation from the CPU for intensive graphics applications. They have many cores (thread processors), currently the Tesla K40 (Kepler GK110) has 2880 cores at 745 Mhz (Builtin boost to 810, 875Mhz). The raw computing power is in the order of Teraflops (4.3 Tflops in SP and 1.43 Tflops in DP). Memory Bandwidth (GDDR5) 288 GB/sec. Memory size 12 GB. Cost USD 5,000. Low end version Geforce GTX Titan: USD 1000. CIMEC-CONICET-UNL 2 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 3. Advances in NS solver on GPU por M.Storti et.al. Scientific computing on GPUs (cont.) The difference between the GPUs architecture and standard multicore processors is that GPUs have much more computing units (ALU’s (Arithmetic-Logic Unit) and SFU’s (Special Function Unit), but few control units. The programming model is SIMD (Single Instruction Multiple Data). GPUs compete with many-core processors (e.g. Intel’s Xeon Phi) Knights-Corner, Xeon-Phi 60 cores). if (COND) { BODY-TRUE; } else { BODY-FALSE; } CIMEC-CONICET-UNL 3 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 4. Advances in NS solver on GPU por M.Storti et.al. Xeon Phi In December 2012 Intel launched the Xeon Phi coprocessor card: 3100 and 5110P. (2000 USD to 2600 USD). It has 60 cores with 22nm technology (clock speed 1GHz approx). “Supercomputer on a card” (SOC). Today limitation is that(with 22nm technology) is that 5e9 transistors can be put on a sinle chip. Today Xeon processors have typically 2.5e9 transistors. Xeon Phi has 60 cores equivalent to the original Pentium processor (40e6 transistors). CIMEC-CONICET-UNL 4 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 5. Advances in NS solver on GPU por M.Storti et.al. Xeon Phi (cont.) Xeon Phi is an alien computer. It fits in a PCI Express X 16 slot, and has its own basic Linux system. You can SSH to the card and run x86-64 code. Another workflow is to run the code in the host and send intensive computing tasks to the card (e.g. solving a linear system). On January 2013 Texas Advanced Computing Center (TACC) added Xeon Phi’s to his Stampede supercomputer. Main CPUs are Xeon E5-2680. 128 nodes have Nvidia Kepler K20 GPUs. Estimated performance 7.6 Pflops. Tianhe-2 (China) the current fastes supercomputer (33.86 Pflops) includes also Xeon Phi coprocessors. Part of Intel’s Many Integrated Core (MIC) architecture. Previous codenames for the project: Larrabee, Knights Ferry, Knights-Corner.) CIMEC-CONICET-UNL 5 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 6. Advances in NS solver on GPU por M.Storti et.al. We want to use GPUs... for what? Aerodynamics of a racing car Aerodynamics of an artillery projectile Flow patten in a harbour Dust particles in a coke processing facility Thermo fluid coupling Aerodynamics of a falling object Flutter at Ma=2.3 (launch video showall) CIMEC-CONICET-UNL 6 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 7. Advances in NS solver on GPU por M.Storti et.al. GPUs in HPC Some HPC people are skeptical about the efficient computing power of GPUs for scientific applications. In many works speedup is referred to available CPU processors, which is not consistent. Delivered speedup w.r.t. mainstream x86 processors is often much lower than expected. Strict data parallelism is difficult to achieve on CFD applications. Unfortunately, this idea is reinforced by the fact that GPUs come from the videogame special effects industry, not with scientific computing. CIMEC-CONICET-UNL 7 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 8. Advances in NS solver on GPU por M.Storti et.al. Solution of incompressible Navier-Stokes flows on GPU GPU’s are less efficient for algorithms that require access to the card’s (device) global memory. Shared memory is much faster but usually scarce (16K per thread block in the Tesla C1060) . The best algorithms are those that make computations for one cell requiring only information on that cell and their neighbors. These algorithms are classified as cellular automata (CA). Lattice-Boltzmann and explicit F?M (FDM/FVM/FEM) fall in this category. Structured meshes require less data to exchange between cells (e.g. neighbor indices are computed, no stored), and so, they require less shared memory. Also, very fast solvers like FFT-based (Fast Fourier Transform) or Geometric Multigrid are available . CIMEC-CONICET-UNL 8 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 9. Advances in NS solver on GPU por M.Storti et.al. Fractional Step Method on structured grids with QUICK Proposed by Molemaker et.al. SCA’08: 2008 ACM SIGGRAPH, Low viscosity flow simulations for animation. Fractional Step Method (a.k.a. pressure segregation) u; v;w and continuity cells are staggered (MAC=Marker And Cell). QUICK advection scheme is used in the predictor stage. Poisson system is solved with IOP (Iterated Orthogonal Projection) (to be described later), on top of Geometric MultiGrid j+2 j+1 j Uij Pij Vij (ih,jh) i+1 i+2 x y i y-momentum cell x-momentum nodes y-momentum nodes x-momentum cell continuity nodes continuity cell CIMEC-CONICET-UNL 9 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 10. Advances in NS solver on GPU por M.Storti et.al. Solution of the Poisson with FFT Solution of the Poisson equation is, for large meshes, the more CPU consuming time stage in Fractional-Step like Navier-Stokes solvers. We have to solve a linear system Ax = b The Discrete Fourier Transform (DFT) is an orthogonal transformation ^x = Ox = t(x). The inverse transformation O1 = OT is the inverse Fourier Transform x = OT ^x = it(^x). If the operator matrix A is spatially invariant (i.e. the stencil is the same at all grid points) and the b.c.’s are periodic, then it can be shown that O diagonalizes A, i.e. OAO1 = D. So in the transformed basis the system of equations is diagonal (OAO1) (Ox) = (Ob); D^x = ^ b; (1) For N = 2p the Fast Fourier Transform (FFT) is an algorithm that computes the DFT (and its inverse) in O(N log(N)) operations. CIMEC-CONICET-UNL 10 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 11. Advances in NS solver on GPU por M.Storti et.al. Solution of the Poisson with FFT (cont.) So the following algorithm computes the solution of the system in O(N log(N)) ops. . ^b = t(b); (transform r.h.s) . ^x = D1^ b, (solve diagonal system O(N)) . x = it(^x), (anti-transform to get the sol. vector) Total cost: 2 FFT’s, plus one element-by-element vector multiply (the reciprocals of the values of the diagonal of D are precomputed) In order to precompute the diagonal values of D, . We take any vector z and compute y = Az, . then transform ^z = t(z), ^y = t(y), . Djj = ^yj=^zj . CIMEC-CONICET-UNL 11 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 12. Advances in NS solver on GPU por M.Storti et.al. Solution of the Poisson equation on embedded geometries FFT solver and GMG are very fast but have several restrictions: invariance of translation, periodic boundary conditions. They are not well suited for embedded geometries. One approach for the solution is the IOP (Iterated Orthogonal Projection) algorithm. It is based on solving iteratively the Poisson eq. on the whole domain (fluid+solid). Solving in the whole domain is fast, because algorithms like Geometric Multigrid or FFT can be used. Also, they are very efficient running on GPU’s . However, if we solve in the whole domain, then we can’t enforce the boundary condition (@p=@n) = 0 at the solid boundary which, then means the violation of the condition of impenetrability at the solid boundary . CIMEC-CONICET-UNL 12 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 13. Advances in NS solver on GPU por M.Storti et.al. The IOP (Iterated Orthogonal Projection) method The method is based on succesively solve for the incompressibility condition (on the whole domain: solid+fluid), and impose the boundary condition. on the whole domain (fluid+solid) violates impenetrability b.c. satisfies impenetrability b.c. CIMEC-CONICET-UNL 13 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 14. Advances in NS solver on GPU por M.Storti et.al. The IOP (Iterated Orthogonal Projection) method (cont.) Fixed point iteration wk+1 = bdydivwk: Projection on the space of divergence-free velocity fields: u0 = div(u) ( u0 = u rP; P = r u; Projection on the space of velocity fields that satisfy the impenetrability boundary condition u00 = bdy(u0) ( u00 = ubdy; in bdy; u00 = u0; in uid: CIMEC-CONICET-UNL 14 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 15. Advances in NS solver on GPU por M.Storti et.al. Implementation details on the GPU We use the CUFFT library. Per iteration: 2 FFT’s and Poisson residual evaluation. The FFT on the GPU Tesla C1060 performs at 27 Gflops, (in double precision) where the operations are counted as 5N log2(N). 30 25 20 15 10 5 0 10^{3} 10^{4} 10^{5} 10^{6} 10^{7} 10^{8} FFT computing rate [Gflops] Vector size CIMEC-CONICET-UNL 15 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 16. Advances in NS solver on GPU por M.Storti et.al. FFT computing rates in GPGPU. GTX-580 250 200 150 100 50 0 128x64x64 simple precision double precision 105 106 107 FFT proc. rate [Gflops/sec] Ncell 64x64x64 128x128x64 128x128x128 256x128x128 256x256x128 CIMEC-CONICET-UNL 16 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 17. Advances in NS solver on GPU por M.Storti et.al. FFTW on i7-3820@3.60Ghz (Sandy Bridge) 25000 20000 FFT proc. rate [Mflops] nthreads=1 15000 10000 5000 0 nthreads=4 nthreads=2 103 104 105 106 107 108 109 Ncell CIMEC-CONICET-UNL 17 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 18. Advances in NS solver on GPU por M.Storti et.al. NSFVM Computing rates in GPGPU. Scaling 160 140 120 100 80 60 40 20 C2050-SP 10-1 100 101 102 # of cells [Mcell] rate [Mcell/sec] GTX-580 SP GTX-580 DP C2050-DP 64x64x64 128x64x64 128x128x64 128x128x128 256x128x128 256x256x128 256x256x256 CIMEC-CONICET-UNL 18 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 19. Advances in NS solver on GPU por M.Storti et.al. NSFVM and “Real Time” computing For a 128x128x128 mesh ( 2Mcell), we have a computing time of 2 Mcell/(140 Mcell/sec) = 0.014 secs/time step. That means 70 steps/sec. A von Neumann stability analysis shows that the QUICK stabilization scheme is inconditionally stable if advanced in time with Forward Euler. With a second order Adams-Bashfort scheme the critical CFL is 0.588. For NS eqs. the critical CFL has been found to be somewhat lower ( 0.5). If L = 1, u = 1, h = 1=128, t = 0:5h=u = 0:004 [sec], so that we can compute in 1 sec, 0.28 secs of simulation time. We say ST/RT=0.28. In the GTX Titan Black we expect to run 3.5x times faster, and has the double of memory (6 GB versus 3 GB). We expect that it would allow run with up to 12 Mcells (currently 6 Mcells in the GTX 580). (launch video nsfvm-bodies-all), CIMEC-CONICET-UNL 19 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 20. Advances in NS solver on GPU por M.Storti et.al. NSFVM and “Real Time” computing (cont.) CIMEC-CONICET-UNL 20 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 21. Advances in NS solver on GPU por M.Storti et.al. Computing times in GPGPU. Fractional Step components 50 40 30 20 10 0 predictor comp-div CG-FFT-solver corrector time participation in total computing time [%] 128x128x128 mesh 256x256x128 mesh GTX 580 (Simple precision) CIMEC-CONICET-UNL 21 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 22. Advances in NS solver on GPU por M.Storti et.al. Current work Current work is done in the following directions Improving performance by replacing the QUICK advection scheme by MOC+BFECC (which could be more GPU-friendly). Implementing a CPU-based renormalization algorithm for free surface (level-set) flows. Another important issue is improving the representation (accuracy) of the solid body surface by using an immersed boundary technique. CIMEC-CONICET-UNL 22 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 23. Advances in NS solver on GPU por M.Storti et.al. Why leave QUICK? One of steps of the Fractional Steps Method is the advection step. We have to advect the velocity field and we desire a method as less diffusive as possible, and that allows as large the CFL number as possible. Also, of course, we want a GPU friendly algorithm. Previously we used QUICK, but it has a stencil that extends more than one cell in the upwind direction. This increases shared memory usage and data transfer. We seek for another low dissipation scheme with a more compact stencil. CIMEC-CONICET-UNL 23 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 24. Advances in NS solver on GPU por M.Storti et.al. Quick advection scheme 1D Scalar advection diffusion: a= advection velocity, advected scalar. @ @x
  • 25.
  • 26.
  • 27.
  • 28. (a) i+1=2 (aQ)i+1 (aQ)i x ; Q i = 8 : 3=8i+1=2 + 6=8i1=2 1=8i3=2 ; if a 0; 3=8ui1=2 + 6=8ui+1=2 1=8ui+3=2 ; if a 0; x control volume cell i-3/2 i-1/2 i+1/2 i+3/2 i+5/2 i i+1 CIMEC-CONICET-UNL 24 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 29. Advances in NS solver on GPU por M.Storti et.al. Method Of Characteristics (MOC) The Method Of Characteristics (MOC) consists in tracking the position of the node following the characteristics to the position it had at time tn and taking its value there, (xn+1; tn+1) = (xn; tn) If xn doesn’t happen to be a mesh node it involves a projection. It’s the basis of the Lagrangian methods for dealing with advection terms. CIMEC-CONICET-UNL 25 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 30. Advances in NS solver on GPU por M.Storti et.al. Method Of Characteristics (MOC) (cont.) So typically MOC has very low diffusion if CFL is an integer number , and too diffusive if it is an semi-integer number . Of course, in the general case (non uniform meshes, non uniform velocity field) we can’t manage to have an integer CFL number for all the nodes. (launch video video-moc-cfl1), (launch video video-moc-cfl05). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 u(t=0) 0 1 2 3 4 CFL 5 max(u) U=1 u(t=10) L=1 periodic MOC BFECC+MOC CIMEC-CONICET-UNL 26 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 31. Advances in NS solver on GPU por M.Storti et.al. MOC+BFECC Assume we have a low order (dissipative) operator (may be SUPG, MOC, or any other) t+t = L(t; u). The Back and Forth Error Compensation and Correction (BFECC) allows to eliminate the dissipation error. . Advance forward the state t+t; = L(t; u). . Advance backwards the state t; = L(t+t;;u). exact w/dissipation error CIMEC-CONICET-UNL 27 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 32. Advances in NS solver on GPU por M.Storti et.al. MOC+BFECC (cont.) exact w/dissipation error If L introduces some dissipative error , then t;6= t, in fact t; = t + 2. So that we can compensate for the error: t+t = L(t;t) ; = t+t; 1=2(t; t) (2) CIMEC-CONICET-UNL 28 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 33. Advances in NS solver on GPU por M.Storti et.al. MOC+BFECC (cont.) (launch video video-moc-bfecc-cfl05). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 u(t=0) 0 1 2 3 4 CFL 5 max(u) U=1 u(t=10) L=1 periodic MOC BFECC+MOC CIMEC-CONICET-UNL 29 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 34. Advances in NS solver on GPU por M.Storti et.al. MOC+BFECC (cont.) Nbr of Cells QUICK-SP BFECC-SP QUICK-DP BFECC-DP 64 64 64 29.09 12.38 15.9 5.23 128 128 128 75.74 18.00 28.6 7.29 192 192 192 78.32 17.81 30.3 7.52 Cubic cavity. Computing rates for the whole NS solver (one step) in [Mcell/sec] obtained with the BFECC and QUICK algorithms on a NVIDIA GTX 580. 3 Poisson iterations were used. [jump to Conclusions] CIMEC-CONICET-UNL 30 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 35. Advances in NS solver on GPU por M.Storti et.al. Analysis of performance Regarding the performance results shown above, it can be seen that the computing rate of QUICK is at most 4x faster than that of BFECC. So BFECC is more efficient than QUICK whenever used with CFL 2, being the critical CFL for QUICK 0.5. The CFL used in our simulations is typically CFL 5 and, thus, at this CFL the BFECC version runs 2.5 times faster than the QUICK version. The speedup of MOC+BFECC versus QUICK increases with the number of Poisson iterations. In the limit of very large number of iters (very low tolerance in the tolerance for Poisson) we expect a speedup 10x (equal to the CFL ratio). CIMEC-CONICET-UNL 31 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 36. Advances in NS solver on GPU por M.Storti et.al. Validation. Lid driven 3D cubic cavity Re=1000, mesh of 128 128 128 (2 Mcell). Results compared with Ku et.al (JCP 70(42):439-462 (1987)). More validation and complete performance study at Costarelli et.al, Cluster Computing (2013), DOI:10.1007/s10586-013-0329-9. 1.2 0.88 0.56 0.24 −0.08 −0.4 0 0.2 0.4 0.6 0.8 1 y w(y) −0.5 −0.34 −0.18 −0.02 0.14 0.3 1 0.8 0.6 0.4 0.2 0 v(z) z Ku et al. NSFDM CIMEC-CONICET-UNL 32 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 37. Advances in NS solver on GPU por M.Storti et.al. Renormalization Even with a high precision, low dissipative algorithm for transporting the level set function we have to renormalize ! 0 with a certain frequency the level set function. Requirements on the renormalization algorithm are: . 0 must preserve as much as posible the 0 level set function (interface) . . 0 must be as regular as possible near the interface. . 0 must have a high slope near the interface. . Usually the signed distance function is used, i.e. 0(x) = sign((x)) min y2 Φ (renormalized) jjy xjj Φ x Γ Γ Γ CIMEC-CONICET-UNL 33 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 38. Advances in NS solver on GPU por M.Storti et.al. Renormalization (cont.) Computing plainly the distance function is O(NN) where N is the number of points on the interface. This scales typically / N1+(nd1)=nd (N5=3 in 3D). Many variants are based in solving the Eikonal equation jrj = 1; As it is an hyperbolic equation it can be solved by a marching technique. The algorithm traverses the domain with an advancing front starting from the level set. However, it can develop caustics (shocks), and rarefaction waves. So, an entropy condition must be enforced. caustic expansion fan CIMEC-CONICET-UNL 34 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 39. Advances in NS solver on GPU por M.Storti et.al. Renormalization (cont.) The Fast Marching algorithm proposed by Sethian (Proc Nat Acad Sci 93(4):1591-1595 (1996)) , is a fast (near optimal) algorithm based on Dijkstra’s algorithm for computing minimum distances in graphs from a source set. (Note: the original Dijkstra’s algorithm is O(N2), not fast. The fast version using a priority queue is due to Fredman and Tarjan (ACM Journal 24(3):596-615, 1987), and the complexity is O(N log(jQj)) O(N log(N))). Level set Q=advancing front F=far-away CIMEC-CONICET-UNL 35 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 40. Advances in NS solver on GPU por M.Storti et.al. The Fast Marching algorithm We explain for the positive part 0. Then the algorithm is reversed for 0. All nodes are in either: Q=advancing front, F=far-away , I=frozen/inactive. The advancing front sweeps the domain starting at the level set and converts F nodes to I. X Initially Q = fnodes that are in contact with the level setg. Their distance to the interface is computed for each cut-cell. The rest is in F =far-away. loop: Take the node X in Q closest to the interface. Move it from Q ! I. Level set Update all distances from neighbors to X and move them from F ! Q. Go to loop. Algorithm ends when Q = ;. Q=advancing front F=far-away I= frozen/inactive CIMEC-CONICET-UNL 36 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 41. Advances in NS solver on GPU por M.Storti et.al. FastMarch: error and regularity of the distance function Numerical example shows regularity of computed distance function in a mesh of 100x100. We have a LS consisting of a circle R = 0:2 inside a square of L = 1. is shown along the x = 0:6 cut of the geometry, also we show the first and second derivatives. deviates less than 103 from the analytical distance. Small spikes are observed in the second derivative. The error ex shows the discontinuity in the slope at the LS. 0.1 0 -0.1 -0.2 -0.3 -0.4 X R=0.2 Φ 0 0.2 0.4 0.6 0.8 1 1 0.5 0 -0.5 -1 dΦ/dx L=1 0 0.2 0.4 0.6 0.8 1 0 -5 -10 -15 -20 d2Φ/dx2 0 0.2 0.4 0.6 0.8 1 0 -0.0002 -0.0004 -0.0006 -0.0008 -0.001 |Φ-Φex| 0 0.2 0.4 0.6 0.8 1 CIMEC-CONICET-UNL 37 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 42. Advances in NS solver on GPU por M.Storti et.al. FastMarch: implementation details Complexity is O(N) the cost of finding the node in Q closest to the level set. This can be implemented in a very efficient way with a priority queue implemented in top of a heap. In this way finding the closest node is O(log jQj). So the total cost is O(N log jQj) O(N log(N (nd1)=nd )) = O(N logN2=3) (in 3D). The standard C++ class priority_queue is not appropriate because don’t give access to the elements in the queue. We implemented the heap structure on top of a vector and an unordered_map (hash-table based) that tracks the Q-nodes in the structure. The hash function used is very simple. CIMEC-CONICET-UNL 38 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 43. Advances in NS solver on GPU por M.Storti et.al. FastMarch renorm: Efficiency The Fast Marching algorithm is O(N log jQj) where N is the number of cells and jQj the size of the advancing front. Rates were evaluated in an Intel i7-950@3.07 (Nehalem). Computing rate is practically constant and even decreases with high N. Since the rate for the NS-FVM algorithm is 100 [Mcell/s], renormalization at a frequency greater than 1/200 steps would be too expensive. Cost of renormalization step is reduced with band renormalization and parallelism (SMP). N (Nbr of cells per side) processing rate [Mcells/s] 2048x2048 1024x1024 512x512 256x256 128x128 64x64 32x32 0.6 0.5 0.4 0.3 0.2 0.1 0 103 104 105 106 107 CIMEC-CONICET-UNL 39 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 44. Advances in NS solver on GPU por M.Storti et.al. FastMarch renorm: band renormalization The renormalization algorithm doesn’t need to cover the whole domain. Only a band around the level set (interface) is needed. The algorithm is modified simply: set distance in far-away nodes to d = dmax. Cost is proportional to the volume of the band, i.e.: Vband = Sband 2dmax / dmax. Low dmax reduces cost, but increases the probability of forcing a new renormalization, and thus increasing the renormalization frequency. Φ (renormalized w/o dmax) x Φold dmax dmax Φ (renormalized w/ dmax) width of renormalization band = 2 dmax CIMEC-CONICET-UNL 40 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 45. Advances in NS solver on GPU por M.Storti et.al. FastMarch renorm: Parallelization How to parallelize FastMarch? We can do speculative parallelism that is while processing a node X at the top of the heap, we can process in parallel the following node Y , speculating that most of the time node Y will be far from X and then can be processed independently. This can be checked afterwards, using time-stamps for instance. 23 16 10 19 21 7 5 Level set 0 1 Y 2 3 X 4 6 12 8 17 14 9 13 20 18 11 15 22 All nodes of the same color can be processed at the same time. CIMEC-CONICET-UNL 41 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 46. Advances in NS solver on GPU por M.Storti et.al. FastMarch renorm: Parallelization (cont.) How much nodes can be processed concurrently? It turns out that the simultaneity (number of nodes that can be processed simultaneously) grows linearly with refinement. Average simultaneity is 16x16: 11.358 32x32: 20.507 Percentage of times simultaneity is 4: 16x16: 93.0% 32x32: 98.0% 80 70 60 50 40 30 20 10 0 0 0.2 0.4 0.6 0.8 1 numer of independent points % advance front 80 70 60 50 40 30 20 10 0 0 0.2 0.4 0.6 0.8 1 % front nodes numer of independent points CIMEC-CONICET-UNL 42 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 47. Advances in NS solver on GPU por M.Storti et.al. FastMarching: computational budget With band renormalization and SMP parallelization we expect a rate of 20 Mcell/s. That means that a 1283 mesh (2 Mcell) can be done in 100 ms. This is 7x times the time required for one time step (14 ms). Renormalization will be amortized if the renormalization frequency is more than 1/20 time steps. Transfer of the data to and from the processor through the PCI Express 2.0 x 16 channel (4 GB/s transfer rate) is in the order of 10 ms. BTW: note that transfers from the CPU to/from the card are amortized if they are performed each 1:10 steps or so. Such transfers can’t be done all time steps. CIMEC-CONICET-UNL 43 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 48. Advances in NS solver on GPU por M.Storti et.al. Fluid structure interaction In the previous examples we have shown several cases of rigid solid bodies moving immersed in a fluid. In those cases the position of the body is prescribed beforehand. A more complex case is when the position of the body results from the interaction with the fluid. In general if the body is not rigid we have to solve the elasticity equations in the body bdy(t). We assume here that the body is rigid, so that we just solve for the 6 d.o.f. of the rigid with the linear and angular momentum of the body. CIMEC-CONICET-UNL 44 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 49. Advances in NS solver on GPU por M.Storti et.al. Fluid structure in the real world Three Gorges (PR China) dam Francis Turbine Belomone Francis Turbine installation Sayano–Shushenskaya (Russia) Dam Machine room prior to 2009 accident Sayano–Shushenskaya (Russia) Dam Machine room after 2009 accident IMPSA (Industria Metal ´ urgica Pescarmona) is manufacturing three Francis turbines for the Belo Monte (Brazil) dam. Once in operation it would be the 3rd largest dam in the world. Ignition of a rocket engine (launch video video-simulacion-motor-cohete). CIMEC-CONICET-UNL 45 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 50. Advances in NS solver on GPU por M.Storti et.al. Fluid structure in the real world (cont.) CIMEC computed the rotordynamic coefficients, that is a reduced parameter representation of the fluid in the gap between stator and rotor. (launch video movecyl-tadpoles) The rotordynamic coefficientes represent the forces of the fluid on to the rotor as the rotor orbits around its centered position. They are used as a component of the design of the structural part of the turbine. They represent the added mass, damping, and stiffness (a.k.a. Lomakin effect). The geometry is very complex: a circular corona of diameter 8m, gap 3mm, and height 2m. CIMEC-CONICET-UNL 46 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 51. Advances in NS solver on GPU por M.Storti et.al. A buoyant fully submerged sphere Fluid is water uid = 1000 [kg=m3]. The body is a smooth sphere D =10cm, fully submerged in a box of L =39cm. Solid density is bdy = 366 [kg=m3] 0:4 uid so the body has positive buoyancy. The buoy (sphere) is tied by a string to an anchor point at the center of the bottom side of the box. Ths box is subject to horizontal harmonic oscillations xbox = Abox sin(!t); CIMEC-CONICET-UNL 47 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 52. Advances in NS solver on GPU por M.Storti et.al. Spring-mass-damper oscillator The acceleration of the box creates a driving force on the sphere. Note that as the sphere is positively buoyant the driving force is in phase with the displacement. That means that if the box accelerates to the right, then the sphere tends to move in the same direction. (launch video p2-f09) The buoyancy of the sphere produces a restoring force that tends to keep aligned the string with the vertical direction. The mass of the sphere plus the added mass of the liquid gives inertia. The drag of the fluid gives a damping effect. So the system can be approximated as a (nonlinear) spring-mass-damper oscillator inertia z }| { mtotx+ damp z }| { 0:5 CdAjvjx_ + spring z }| { (m otg=L)x = driving frc z }| { m otabox; mtot = mbdy + madd; m ot = md mbdy; madd = Cmaddmd; Cmadd 0:5; md = Vbdy CIMEC-CONICET-UNL 48 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 53. Advances in NS solver on GPU por M.Storti et.al. Experimental results Experiments carried out atUniv Santiago de Chile (Dra. Marcela Cruchaga). Plexiglas box mounted on a shake table, subject to oscillations of 2cm amplitude in freqs 0.5-1.5 Hz. Sphere position is obtained using Motion Capture (MoCap) from high speed (HS) camera (800x800 pixels @ 120fps). HS cam is AOS Q-PRI, can do 3 MPixel @500fps, max frame rate is 100’000 fps. MoCap is performed with a home-made code. Finds current position by minimization of a functional. Can also detect rotations. In this exp setup 1 pixel=0.5mm (D=200 px). Camera distortion and refraction have been evaluated to be negligible. MoCap resolution can be less than one pixel. (launch video mocap2) CIMEC-CONICET-UNL 49 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 54. Advances in NS solver on GPU por M.Storti et.al. Experimental results (cont.) 180 160 140 120 100 80 60 40 20 0 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 phase[deg] freq[Hz] anly(CD=0.3,Cmadd=0.55) exp 0.1 0.08 0.06 0.04 0.02 0 0.4 0.6 0.8 1 1.2 1.4 1.6 Amplitude[m] freq[Hz] anly(CD=0.3,Cmadd=0.55) exp Best fit of the experimental results with the simple spring-mass-damper oscillator are obtained for Cmadd = 0:55 and C = d = 0:3. Reference values are: Cmadd = 0:5 (potential flow), Cmadd = 0:58 (Neill etal Amer J Phys 2013), CD=0.47 (reference value). (launch video description-buoy-resoviz) CIMEC-CONICET-UNL 50 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 55. Advances in NS solver on GPU por M.Storti et.al. Computational FSI model The fluid structure interaction is performed with a partitioned Fluid Structure Interaction (FSI) technique. Fluid is solved in step [tn; tn+1 = tn + t] between the known position xn of the body at tn and an extrapolated (predicted) position xn+1;P at tn+1. Then fluid forces (pressure and viscous tractions) are computed and the rigid body equations are solved to compute the true position of the solid xn+1. The scheme is second order precision. It can be put as a fixed point iteration and iterated to convergence (enhancing stability). The rigid body eqs. are now: mbdyx + (m otg=L)x = F uid + m otabox where F uid = Fdrag + Fadd are now computed with the CFD-NS-FVM code. CIMEC-CONICET-UNL 51 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 56. Advances in NS solver on GPU por M.Storti et.al. Why to use a positive buoyant body? Gov. equations for the body: x + (m otg=L)x = comp w=CFD z }| { F uid(fxg) +m otabox where F uid = Fdrag + Fadd are fluid forces computed with the CFD-NS-FVM code. The whole methodology (to compute the response curves amplitude vs. frequency) and compare with experiments is a means to validate the precision of the numerical model. The most important and harder from the point of view of the numerical method is to compute the the fluid forces (drag and added mass). Buoyancy is almost trivial and can be easily well captured of factored out from the simulation. Then, if the body is too heavy it’s not affected by the fluid forces, so we prefer a body as lighter as possible. In the present example mbdy = 0:192[kg], md = 0:523[kg], madd = 0:288[kg] CIMEC-CONICET-UNL 52 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 57. Advances in NS solver on GPU por M.Storti et.al. Numerical results Results with NS-VFM-GPGPU on GTX-580 3GB RAM. 6 Mcells. Rate 140 Mcell/sec. (launch video fsi-boya-gpu-vorticity), (launch video gpu-corte), (launch video particles-buoyp5) Also as a reference PETSc-FEM results included. Boundary fitted mesh 400 k-elements. (launch video fem-mesh-quality) Vorticity isosurface velocity at sym plane tadpoles FEM mesh distortion CIMEC-CONICET-UNL 53 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 58. Advances in NS solver on GPU por M.Storti et.al. Comparison of experimental and numerical results 0.1 0.08 0.06 0.04 0.02 0 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 Amplitude[m] freq[Hz] exp anal(CD=0.3,Cmadd=0.55) NSFVM/GPU FEM 0 -50 -100 -150 -200 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 Phase[deg] freq[Hz] exp anal(CD=0.3,Cmadd=0.55) NSFVM/GPU FEM CIMEC-CONICET-UNL 54 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 59. Advances in NS solver on GPU por M.Storti et.al. Conclusions The NS-FVM implementation reaches high computing rates in GPGPU hardware (O(140 Mcell/s)). It can represent complex moving bodies without meshing. Surface representation of bodies can be made second order (not implemented yet). Solution of the Poisson problem is currently a significant part of the computing time. This is reduced by using the AGP preconditioner and MOC-BFECC combination. MOC+BFECC has lower computing rates than QUICK (4x slower) but may reach CFL=5 (versus CFL=0.5 for QUICK). So we get a speedup of 2.5x. Speedups may be higher if lower tolerances are required for the Poisson stage (more Poisson iters). Fluid-Structure interaction is solved with a partitioned strategy and is validated against experimental and numerical (FEM) results. CIMEC-CONICET-UNL 55 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))
  • 60. Advances in NS solver on GPU por M.Storti et.al. Acknowledgments This work has received financial support from Consejo Nacional de Investigaciones Cient´ıficas y T´ecnicas (CONICET, Argentina, PIP 5271/05), Universidad Nacional del Litoral (UNL, Argentina, grant CAI+D 2009-65/334), Agencia Nacional de Promoci´on Cient´ıfica y Tecnol ´ ogica (ANPCyT, Argentina, grants PICT-1506/2006, PICT-1141/2007, PICT-0270/2008), and European Research Council (ERC) Advanced Grant, Real Time Computational Mechanics Techniques for Multi-Fluid Problems (REALTIME, Reference: ERC-2009-AdG, Dir: Dr. Sergio Idelsohn). The authors made extensive use of Free Software as GNU/Linux OS, GCC/G++ compilers, Octave, and Open Source software as VTK among many others. In addition, many ideas from these packages have been inspiring to them. CIMEC-CONICET-UNL 56 ((version cdcomp-enief2014-start-45-g8fb7450 Thu Oct 23 14:45:29 2014 -0300) (date Fri Oct 24 19:37:34 2014 -0300))