Presented at the 15th International Conference on Numerical Combustion in Avignon, France (19–22 April 2015).
Combustion simulations with detailed chemical kinetics require the integration of a large number of ordinary differential equation (ODEs), with at least one ODE system per spatial location solved every time step. This task is well-suited to the massively parallel processing capabilities of graphics processing units (GPUs), where individual GPU threads concurrently integrate independent ODE systems for different spatial locations. However, the typical high-order implicit algorithms used in combustion modeling applications (e.g., VODE, LSODE) to handle stiffness involve complex logical flow that causes severe thread divergence when implemented on GPUs, thus limiting performance. Alternate algorithms are therefore needed. This talk will discuss strategies and results using integration algorithms for nonstiff and stiff chemical kinetics on GPUs.
Using GPUs to accelerate nonstiff and stiff chemical kinetics in combustion simulations
1. Using GPUs to accelerate
nonstiff and stiff chemical
kinetics in combustion
simulations
Kyle Niemeyer
School of Mechanical, Industrial, and Manufacturing Eng.
Oregon State University
20 April 2015
2. Benefit of GPU computing
(for combustion)
2
Two
avenues
https://www.olcf.ornl.gov/titan/
Exascale science on supercomputers
High-fidelity engineering
on workstations
3. Challenges of GPU computing
(for combustion)
3
Two
challenges
Design algorithms/strategies to
reduce computational expense
Identify appropriate algorithms
for equal or better performance
4. GPU
• Graphics Processing Unit
• Developed to process & display 1000s pixels
• Throughput over latency massive parallelism
4
TECHNICAL SPECIFICATIONS
FORM FACTOR> 9.75” PCIe x16 form factor
# OF CUDA CORES> 448
FRE
of high performance
5. Modern GPU hardware
architecture
Streaming
multiprocessor
(SM)
5
A.R. Brodtkorb et al. / J. Parallel Distrib. Comput. 73 (2013) 4–13
Fermi-class GPU hardware. The GPU consisting of up to 16 streaming multiprocessors (also known as SMs) is shown in (left), and (righ
the information in this article can be found in
different sources, including books, documentation,
nference presentations, and on Internet fora. Getting
of all this information is an arduous exercise that
bstantial effort. The aim of this article is therefore to
distributes thread blocks to multiprocessor thread sc
Fig. 4a). This scheduler handles concurrent kernel2
e
out-of-order thread block execution.
Each multiprocessor has 16 load/store units, all
and destination addresses to be calculated for 16 thre
Brodtkorb AR, Hagen TR, Sætra ML. J
Parallel Distrib Comput 2013;73:4–13.
6. Using GPUs
• Parallel function: “kernel”
• Hundreds–millions of concurrent threads
• Executed in 32-thread “warps”
• Challenge: thread divergence
6
10. • Large number of independent ODEs to
solve
• Can be even more for turbulent
combustion!
dYi
dt
=
Wi
⇢
!i
0
B
B
B
@
dY1
dt
dY2
dt
...
dYk
dt
1
C
C
C
A
10
17. Runge–Kutta–Cash–Karp
• Fifth-order accuracy
• Adaptive time stepping
• Global time step: 1×10-8 sec
• “Nonstiff” hydrogen mechanism1
• Range number of ODEs from 10–107
1Yetter , Dryer, and Rabitz, CST 79 (1991) 97–128
17
25. C2H4 – RKC vs. VODE (stiff)
25
2.5×
Time step: 1×10-4 s
26. Takeaways
• For exascale (i.e., DNS):
- Explicit algorithms significantly faster on GPUs
• For high-fidelity engineering (i.e., LES):
- Implicit algorithms perform comparably to CPU, so
far… (but not much better)
- Stabilized explicit algorithms offer attractive
alternative
- Greater stiffness still a problem
26
More details: see Niemeyer & Sung, J
Comput Phys 256 (2014):854–871