SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
ORNL is managed by UT-Battelle
for the US Department of Energy
Towards Exascale
Simulations of Stellar
Explosions with FLASH
J. Austin Harris
Scientific Computing Group
Oak Ridge National Laboratory
Collaborators:
Bronson Messer (ORNL)
Tom Papatheodore (ORNL)
2 J. Austin Harris --- OpenPOWER ADG 2018
• Preparing codes to run on the upcoming (CORAL) Summit supercomputer at
ORNL
• Summit – IBM POWER9 + NVIDIA Volta
• EA System – IBM POWER8 + NVIDIA Pascal
Acknowledgements
FLASH – adaptive-mesh, multi-physics simulation code widely used
in astrophysics
http://flash.uchicago.edu/site/
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National
Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No.
DE-AC05-00OR22725.
3 J. Austin Harris --- OpenPOWER ADG 2018
4 J. Austin Harris --- OpenPOWER ADG 2018
Supernovae
• Brightness rivals that of the host galaxy
• Primarily two physical mechanisms:
– Core-collapse supernova (gravity-induced explosion of massive star)
– Thermonuclear supernova (accretion-induced explosion of white dwarf)
Cassiopeia A (SN 1680)Tycho (SN 1572)“The Crab” (SN 1054) Kepler (SN 1604)
5 J. Austin Harris --- OpenPOWER ADG 2018
FLASH code
• FLASH is a publicly available, component-based, MPI+OpenMP
parallel, adaptive mesh refinement (AMR) code that has been
used on a variety of parallel platforms.
• The code has been used to simulate a variety of phenomena,
including
– thermonuclear and core-collapse supernovae,
– galaxy cluster formation,
– classical novae,
– formation of proto-planetary disks, and
– high-energy-density physics.
• FLASH’s multi-physics and AMR capabilities make it an ideal
numerical laboratory for studying nucleosynthesis in supernovae.
• Targeted for CAAR:
– Nuclear kinetics (burn unit) --- GPU-enabled libraries
– Equation of State (EOS) --- OpenACC
– Hydrodynamics and Gravity module performance
6 J. Austin Harris --- OpenPOWER ADG 2018
Nuclear kinetics
• Nuclear composition evolved at each time-step
– Linear solution of coupled set of stiff ODEs
– FLOPS ~ n3
• Accurate treatment important for nuclear energy
generation and determining final composition
• Computational constraints traditionally limit system to < 14 species
• FLASH-CAAR:
– Replace small “hard-wired” reaction network in FLASH with general purpose
reaction network and use GPU-enabled libraries to accelerate solution of ODEs
7 J. Austin Harris --- OpenPOWER ADG 2018
8 J. Austin Harris --- OpenPOWER ADG 2018
Nuclear kinetics
XNet
• General purpose thermonuclear reaction network written in modular
Fortran 90
̇"# = %
&
'#
&(&"& + %
&,+
'#
&,+,-. /0 &,+"&"+ + %
&,+,1
'#
&,+,1,2-.
2 /0 &,+,1"&"+"1
• Stiff system of ODEs
– Implicit solver (Backward Euler/Bader-Deuflhard/Gear):
⃑4 5 + Δ5 =
7 89:8 ;7 8
:8
−
̇
" 5 + Δ5 =0
• Being implemented into a shared repository of microphysics being
developed for AMREx-based codes, including FLASH
– https://starkiller-astro.github.io/Microphysics/
9 J. Austin Harris --- OpenPOWER ADG 2018
XNet in FLASH
• FLASH burner restructured to operate on multiple zones at once from all local
AMR blocks for XNet to evolve simultaneously
!$omp parallel shared(…) private(…)
!$omp do
do k = 1, num_zones
do j = 1, num_timesteps
<build linear system>
dgetrf(…)
dgetrs(…)
<check convergence>
end do
end do
!$omp end do
!$omp end parallel
!$omp parallel shared(…) private(…)
!$omp do
do k = 1, num_local_batches
do j = 1, num_timesteps
<CPU operations>
!$acc parallel loop
do ib = 1, nb
<build ib’th linear system>
end do
!$acc end parallel loop
cublasDgetrfBatched(…)
cublasDgetrsBatched(…)
<send results to CPU>
<check convergence>
end do
end do
!$omp end do
!$omp end parallel
10 J. Austin Harris --- OpenPOWER ADG 2018
FLASH AMR
• Currently uses PARAMESH (MacNeice+, 2000)
• Moving to ECP-supported AMREx (ExaStar ECP project)
11 J. Austin Harris --- OpenPOWER ADG 2018
FLASH AMR Optimization
• Problem: Computational load can be
quite unevenly distributed
• Solution: Weight the Morton space-
filling curve by maximum number of
timesteps taken by any single cell in
a block.
12 J. Austin Harris --- OpenPOWER ADG 2018
FLASH Performance w/ XNet
• Tests performed on single Summit Phase I node
– 2 IBM Power9 (22 cores each), 6 NVIDIA “Volta” V100
GPUS
• 1 3D block (163 = 4096 zones) per rank per
GPU evolved for 20 FLASH timesteps
13 J. Austin Harris --- OpenPOWER ADG 2018
FLASH Early Summit Results
• CAAR work primarily concerned with increasing physical
fidelity by accelerating the nuclear burning module and
associated load balancing.
• Summit GPU performance fundamentally changes the
potential science impact by enabling large-network (i.e.
160 or more nuclear species) simulations.
– Heaviest elements in the Universe are made in
neutron-rich environments – small networks are
incapable of tracking these neutron-rich nuclei
– Opens up the possibility of producing precision
nucleosynthesis predictions to compare to
observations
– Provides detailed information regarding most
astrophysically important nuclear reactions and
masses to be measured at FRIB
NASA, ESA, J. Hester and A. Loll
(Arizona St. Univ.)
n
H
He
Li
Be
B
C
N
O
F
Ne
Na
Mg
Al
Si
P
S
Cl
Ar
K
Ca
Sc
Ti
V
Cr
Mn
Fe
Co
Ni
Cu
Zn
0
1
2
3 4
5 6
7 8
9
10 11 12
13 14
15 16 17
18 19 20 21 22 23 24
25 26 27 28 29
30 31 32 33
34
35 36 37
38
39
40
41
42 43 44
neutrino_p_process (zone_01)
Timestep = 0
Time (sec) = 0.000E+00
Density (g/cm^3) = 4.214E+06
Temperature (T9) = 8.240E+00
nucastrodata.orgnucastrodata.org
Max: 2.35E-01
Min: 1.00E-25
Abundance
Aprox13: 13-species α-chain
X150: 150-species network
ProtonNumber
Neutron Number
Time for 160-species
(blue) run on Summit
roughly equal to 13-
species “alpha” (red)
network run on Titan
>100x the computation
for identical cost
Preliminary results on Summit
GPU+CPU vs. CPU-only performance on
Summit for 288-species network : 2.9x
P9: 24.65 seconds/step
P9 + Volta: 8.5 seconds/step
288-species impossible to run on Titan
14 J. Austin Harris --- OpenPOWER ADG 2018
Equation of State
“Helmholtz EOS” (Timmes & Swesty, 2000)
• Provides closure to thermodynamic system (e.g. P=P(ρ,T,X) )
• Based on Helmholtz free energy formulation
– High-order interpolation from table of free energy (quintic Hermite polynomials)
• OpenACC version developed by collaborators at Stony Brook University
– Part of a shared repository of microphysics being developed for AMREx-based
codes, including FLASH (“starkiller”)
• FLASH traditionally only operates on vectors (i.e. rows from AMR blocks)
– Does this expose enough parallelism? No
• How many AMR blocks should we evaluate the EOS for simultaneously
per MPI rank?
15 J. Austin Harris --- OpenPOWER ADG 2018
Helmholtz EOS
• To determine best use of accelerated EOS in FLASH, we used mini-
app driver:
– Mimics AMR block structure and time stepping in FLASH
• Loops through several time steps
• Change the number of total grid zones
• Fill these zones with new data
• Calculate interpolation in all grid zones
16 J. Austin Harris --- OpenPOWER ADG 2018
Helmholtz EOS
1) Allocate main data arrays (global) on host and device
– Arrays of Fortran derived types
• Each elements holds grid data for single zone
– Persist for the duration of the program
– Used to pass zone data back and forth from host to device
• Reduced set sent from H-to-D
• Full set sent from D-to-H
17 J. Austin Harris --- OpenPOWER ADG 2018
Helmholtz EOS
2) Read in tabulated Helmholtz free energy data and make copy on
device
– This will persist for the duration of the program
– Thermodynamic quantities are interpolated from this table
3) For each time step
– Change number of AMR blocks
• ± 5%, consistent with variation encountered in production simulations at high rank count
– Update device with new grid data
– Launch EOS kernel: calculate all interpolated quantities for all grid zones
– Update host with newly calculated quantities
18 J. Austin Harris --- OpenPOWER ADG 2018
Helmholtz EOS
Basic Flow of Driver Program
!$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1)
!$acc kernels async(thread_id + 1)
do zone = start_element, stop_element
call eos(state(zone), reduced_state(zone))
end do
!$acc end kernels
!$acc update self(state(start_element:stop_element)) async(thread_id + 1)
!$acc wait
!$omp target update to(reduced_state(start_element:stop_element))
!$omp target
!$omp teams distribute parallel do thread_limit(128) num_threads(128)
do zone = start_element, stop_element
call eos(state(zone), reduced_state(zone))
end do
!$omp end teams distribute parallel do
!$omp end target
!$omp target update from(state(start_element:stop_element))
OpenACCOpenMP4.5
19 J. Austin Harris --- OpenPOWER ADG 2018
EOS Experiments
• All experiments carried out on
SummitDev
– Nodes have 2 IBM Power8+ 10-core
CPUs
– peak flop rate of approximately 560 GF
– peak memory bandwidth of 340 GB/sec
• + 4 NVIDIA P100 GPUs
– peak single/double precision flop rate of
10.6/5.3 TF
– peak memory bandwidth of 732 GB/sec
• Number of “AMR” blocks: 1, 10, 100,
1000, 10000 (each with 256 zones)
– Emulates 2D block in FLASH
• Tested with 1, 2, 4, and 10 (CPU)
OpenMP threads for each block count
20 J. Austin Harris --- OpenPOWER ADG 2018
OpenACC vs OpenMP 4.5
• DISCLAIMER:
• At the time these tests were performed:
– PGI’s OpenACC implementation had a mature API (version 16.10)
– IBM’s XL Fortran implementation of OpenMP 4.5 (version 16.1)
• Beta version of the compiler
• Did not allow pinned memory or asynchronous data transfers / kernel execution
21 J. Austin Harris --- OpenPOWER ADG 2018
Results
• For high numbers of AMR blocks, OpenACC is roughly 3x faster
More complicated behavior for lower block counts
22 J. Austin Harris --- OpenPOWER ADG 2018
OpenACC at low block counts
• At low AMR block counts, kernel overhead is
large relative to compute time and increased
work does little to increase total performance.
~0.1 ms
23 J. Austin Harris --- OpenPOWER ADG 2018
OpenACC kernel overheads continued
• Multiple CPU threads stagger
H2D transfers, exacerbating
kernel overhead delay
24 J. Austin Harris --- OpenPOWER ADG 2018
OpenACC at high block counts
• At higher block counts,
kernel overhead is
negligible; now dominated
by D2H transfers
25 J. Austin Harris --- OpenPOWER ADG 2018
OpenMP at low block counts
• There is no asynchronous GPU
execution, i.e. the work enqueued
by each CPU thread is serialized
on the device.
• Performance is proportionally less
than the OpenACC asynchronous
execution.
26 J. Austin Harris --- OpenPOWER ADG 2018
OpenMP at higher block counts
• Lack of asynchronous execution
becomes less important, as the
device compute capability is
saturated.
• D2H (and H2D) transfers are
significantly slower than for
OpenACC, as we here lack the
ability to pin CPU memory.
27 J. Austin Harris --- OpenPOWER ADG 2018
Optimal GPU configuration
• Clear advantage from GPUs when
>100 AMR blocks in 2D (or >6 in 3D)
• Can calculate 100 2D blocks with
GPU in roughly the same time as 1
2D block without
• So in FLASH, we should compute
100s to 1000s of 2D blocks per MPI
rank, depending on available memory
28 J. Austin Harris --- OpenPOWER ADG 2018
EOS Summary
• OpenMP provides an effective to path to performance portability, so despite the
lower performance here, we plan to test the OpenMP 4.5 implementation in
FLASH production.
– Primary factors affecting current OpenMP performance are the serialization of kernels on the
device and high data transfer times associated with having to use pageable memory when
using OpenMP 4.5. These are technical problems that are certainly surmountable.
• In general, we find that the best balance between CPU threads and block
number occurs at and above 2-4 CPU threads and roughly 1,000 2D blocks. We
can retire all 1,000 of these EOS evaluations in a time less than 10x the fastest
100-block calculation for both OpenACC and OpenMP.
– This mode is congruent with our planned production use of FLASH on the OLCF Summit
machine, where we will place 3 MPI ranks on each CPU socket, each bound to one of the
three available, closely coupled GPUs.
29 J. Austin Harris --- OpenPOWER ADG 2018
Conclusions
• Overall, very positive experience with Summit
– Some issues with parallel HDF5 under investigation
• With the upgrades to the nuclear burning and EOS in FLASH, we find
significant speedup (2x - 3x) relative to the CPU alone
• Still plenty of work to do!
– Hydrodynamics
– Gravity
– Radiation Transport

Contenu connexe

Tendances

Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3Tim Bell
 
Interstellar explorermay02
Interstellar explorermay02Interstellar explorermay02
Interstellar explorermay02Clifford Stone
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution Chen Wu
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Frank Wuerthwein
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nutsinside-BigData.com
 
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Riley Waite
 
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE PresentationSpace Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE PresentationSpace_Situational_Awareness
 
OSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe HaenOSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe HaenNETWAYS
 
Differential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDifferential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDaniel Lim
 

Tendances (20)

Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
System Interconnects for HPC
System Interconnects for HPCSystem Interconnects for HPC
System Interconnects for HPC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Future
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
 
Interstellar explorermay02
Interstellar explorermay02Interstellar explorermay02
Interstellar explorermay02
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
DA-JPL-final
DA-JPL-finalDA-JPL-final
DA-JPL-final
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
OOW-IMC-final
OOW-IMC-finalOOW-IMC-final
OOW-IMC-final
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
 
NPOESS Program Overview
NPOESS Program OverviewNPOESS Program Overview
NPOESS Program Overview
 
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)
 
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE PresentationSpace Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
Space Situational Awareness Forum - GERMAN AEROSPACE CENTRE Presentation
 
OSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe HaenOSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe Haen
 
Differential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDifferential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networks
 

Similaire à Towards Exascale Simulations of Stellar Explosions with FLASH

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjManhHoangVan
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascaleinside-BigData.com
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputerinside-BigData.com
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
TeraGrid and Physics Research
TeraGrid and Physics ResearchTeraGrid and Physics Research
TeraGrid and Physics Researchshandra_psc
 
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry ImpactAccelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impactinside-BigData.com
 
Morgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 distMorgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 distddm314
 
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH dongwook159
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance ComputingJuris Vencels
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Overview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing ProjectOverview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing Projectinside-BigData.com
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 

Similaire à Towards Exascale Simulations of Stellar Explosions with FLASH (20)

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
LllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzjLllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
Lllsjjsjsjjshshjshjsjjsjjsjjzjsjjzjjzjjzj
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
 
ECP Application Development
ECP Application DevelopmentECP Application Development
ECP Application Development
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputer
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
TeraGrid and Physics Research
TeraGrid and Physics ResearchTeraGrid and Physics Research
TeraGrid and Physics Research
 
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry ImpactAccelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
 
Morgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 distMorgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 dist
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH AMS 250 - High-Performance, Massively Parallel Computing with FLASH
AMS 250 - High-Performance, Massively Parallel Computing with FLASH
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application Drivers
 
Programming Trends in High Performance Computing
Programming Trends in High Performance ComputingProgramming Trends in High Performance Computing
Programming Trends in High Performance Computing
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Overview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing ProjectOverview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing Project
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 

Plus de Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

Plus de Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 

Dernier

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Towards Exascale Simulations of Stellar Explosions with FLASH

  • 1. ORNL is managed by UT-Battelle for the US Department of Energy Towards Exascale Simulations of Stellar Explosions with FLASH J. Austin Harris Scientific Computing Group Oak Ridge National Laboratory Collaborators: Bronson Messer (ORNL) Tom Papatheodore (ORNL)
  • 2. 2 J. Austin Harris --- OpenPOWER ADG 2018 • Preparing codes to run on the upcoming (CORAL) Summit supercomputer at ORNL • Summit – IBM POWER9 + NVIDIA Volta • EA System – IBM POWER8 + NVIDIA Pascal Acknowledgements FLASH – adaptive-mesh, multi-physics simulation code widely used in astrophysics http://flash.uchicago.edu/site/ This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
  • 3. 3 J. Austin Harris --- OpenPOWER ADG 2018
  • 4. 4 J. Austin Harris --- OpenPOWER ADG 2018 Supernovae • Brightness rivals that of the host galaxy • Primarily two physical mechanisms: – Core-collapse supernova (gravity-induced explosion of massive star) – Thermonuclear supernova (accretion-induced explosion of white dwarf) Cassiopeia A (SN 1680)Tycho (SN 1572)“The Crab” (SN 1054) Kepler (SN 1604)
  • 5. 5 J. Austin Harris --- OpenPOWER ADG 2018 FLASH code • FLASH is a publicly available, component-based, MPI+OpenMP parallel, adaptive mesh refinement (AMR) code that has been used on a variety of parallel platforms. • The code has been used to simulate a variety of phenomena, including – thermonuclear and core-collapse supernovae, – galaxy cluster formation, – classical novae, – formation of proto-planetary disks, and – high-energy-density physics. • FLASH’s multi-physics and AMR capabilities make it an ideal numerical laboratory for studying nucleosynthesis in supernovae. • Targeted for CAAR: – Nuclear kinetics (burn unit) --- GPU-enabled libraries – Equation of State (EOS) --- OpenACC – Hydrodynamics and Gravity module performance
  • 6. 6 J. Austin Harris --- OpenPOWER ADG 2018 Nuclear kinetics • Nuclear composition evolved at each time-step – Linear solution of coupled set of stiff ODEs – FLOPS ~ n3 • Accurate treatment important for nuclear energy generation and determining final composition • Computational constraints traditionally limit system to < 14 species • FLASH-CAAR: – Replace small “hard-wired” reaction network in FLASH with general purpose reaction network and use GPU-enabled libraries to accelerate solution of ODEs
  • 7. 7 J. Austin Harris --- OpenPOWER ADG 2018
  • 8. 8 J. Austin Harris --- OpenPOWER ADG 2018 Nuclear kinetics XNet • General purpose thermonuclear reaction network written in modular Fortran 90 ̇"# = % & '# &(&"& + % &,+ '# &,+,-. /0 &,+"&"+ + % &,+,1 '# &,+,1,2-. 2 /0 &,+,1"&"+"1 • Stiff system of ODEs – Implicit solver (Backward Euler/Bader-Deuflhard/Gear): ⃑4 5 + Δ5 = 7 89:8 ;7 8 :8 − ̇ " 5 + Δ5 =0 • Being implemented into a shared repository of microphysics being developed for AMREx-based codes, including FLASH – https://starkiller-astro.github.io/Microphysics/
  • 9. 9 J. Austin Harris --- OpenPOWER ADG 2018 XNet in FLASH • FLASH burner restructured to operate on multiple zones at once from all local AMR blocks for XNet to evolve simultaneously !$omp parallel shared(…) private(…) !$omp do do k = 1, num_zones do j = 1, num_timesteps <build linear system> dgetrf(…) dgetrs(…) <check convergence> end do end do !$omp end do !$omp end parallel !$omp parallel shared(…) private(…) !$omp do do k = 1, num_local_batches do j = 1, num_timesteps <CPU operations> !$acc parallel loop do ib = 1, nb <build ib’th linear system> end do !$acc end parallel loop cublasDgetrfBatched(…) cublasDgetrsBatched(…) <send results to CPU> <check convergence> end do end do !$omp end do !$omp end parallel
  • 10. 10 J. Austin Harris --- OpenPOWER ADG 2018 FLASH AMR • Currently uses PARAMESH (MacNeice+, 2000) • Moving to ECP-supported AMREx (ExaStar ECP project)
  • 11. 11 J. Austin Harris --- OpenPOWER ADG 2018 FLASH AMR Optimization • Problem: Computational load can be quite unevenly distributed • Solution: Weight the Morton space- filling curve by maximum number of timesteps taken by any single cell in a block.
  • 12. 12 J. Austin Harris --- OpenPOWER ADG 2018 FLASH Performance w/ XNet • Tests performed on single Summit Phase I node – 2 IBM Power9 (22 cores each), 6 NVIDIA “Volta” V100 GPUS • 1 3D block (163 = 4096 zones) per rank per GPU evolved for 20 FLASH timesteps
  • 13. 13 J. Austin Harris --- OpenPOWER ADG 2018 FLASH Early Summit Results • CAAR work primarily concerned with increasing physical fidelity by accelerating the nuclear burning module and associated load balancing. • Summit GPU performance fundamentally changes the potential science impact by enabling large-network (i.e. 160 or more nuclear species) simulations. – Heaviest elements in the Universe are made in neutron-rich environments – small networks are incapable of tracking these neutron-rich nuclei – Opens up the possibility of producing precision nucleosynthesis predictions to compare to observations – Provides detailed information regarding most astrophysically important nuclear reactions and masses to be measured at FRIB NASA, ESA, J. Hester and A. Loll (Arizona St. Univ.) n H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 neutrino_p_process (zone_01) Timestep = 0 Time (sec) = 0.000E+00 Density (g/cm^3) = 4.214E+06 Temperature (T9) = 8.240E+00 nucastrodata.orgnucastrodata.org Max: 2.35E-01 Min: 1.00E-25 Abundance Aprox13: 13-species α-chain X150: 150-species network ProtonNumber Neutron Number Time for 160-species (blue) run on Summit roughly equal to 13- species “alpha” (red) network run on Titan >100x the computation for identical cost Preliminary results on Summit GPU+CPU vs. CPU-only performance on Summit for 288-species network : 2.9x P9: 24.65 seconds/step P9 + Volta: 8.5 seconds/step 288-species impossible to run on Titan
  • 14. 14 J. Austin Harris --- OpenPOWER ADG 2018 Equation of State “Helmholtz EOS” (Timmes & Swesty, 2000) • Provides closure to thermodynamic system (e.g. P=P(ρ,T,X) ) • Based on Helmholtz free energy formulation – High-order interpolation from table of free energy (quintic Hermite polynomials) • OpenACC version developed by collaborators at Stony Brook University – Part of a shared repository of microphysics being developed for AMREx-based codes, including FLASH (“starkiller”) • FLASH traditionally only operates on vectors (i.e. rows from AMR blocks) – Does this expose enough parallelism? No • How many AMR blocks should we evaluate the EOS for simultaneously per MPI rank?
  • 15. 15 J. Austin Harris --- OpenPOWER ADG 2018 Helmholtz EOS • To determine best use of accelerated EOS in FLASH, we used mini- app driver: – Mimics AMR block structure and time stepping in FLASH • Loops through several time steps • Change the number of total grid zones • Fill these zones with new data • Calculate interpolation in all grid zones
  • 16. 16 J. Austin Harris --- OpenPOWER ADG 2018 Helmholtz EOS 1) Allocate main data arrays (global) on host and device – Arrays of Fortran derived types • Each elements holds grid data for single zone – Persist for the duration of the program – Used to pass zone data back and forth from host to device • Reduced set sent from H-to-D • Full set sent from D-to-H
  • 17. 17 J. Austin Harris --- OpenPOWER ADG 2018 Helmholtz EOS 2) Read in tabulated Helmholtz free energy data and make copy on device – This will persist for the duration of the program – Thermodynamic quantities are interpolated from this table 3) For each time step – Change number of AMR blocks • ± 5%, consistent with variation encountered in production simulations at high rank count – Update device with new grid data – Launch EOS kernel: calculate all interpolated quantities for all grid zones – Update host with newly calculated quantities
  • 18. 18 J. Austin Harris --- OpenPOWER ADG 2018 Helmholtz EOS Basic Flow of Driver Program !$acc update device(reduced_state(start_element:stop_element)) async(thread_id + 1) !$acc kernels async(thread_id + 1) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$acc end kernels !$acc update self(state(start_element:stop_element)) async(thread_id + 1) !$acc wait !$omp target update to(reduced_state(start_element:stop_element)) !$omp target !$omp teams distribute parallel do thread_limit(128) num_threads(128) do zone = start_element, stop_element call eos(state(zone), reduced_state(zone)) end do !$omp end teams distribute parallel do !$omp end target !$omp target update from(state(start_element:stop_element)) OpenACCOpenMP4.5
  • 19. 19 J. Austin Harris --- OpenPOWER ADG 2018 EOS Experiments • All experiments carried out on SummitDev – Nodes have 2 IBM Power8+ 10-core CPUs – peak flop rate of approximately 560 GF – peak memory bandwidth of 340 GB/sec • + 4 NVIDIA P100 GPUs – peak single/double precision flop rate of 10.6/5.3 TF – peak memory bandwidth of 732 GB/sec • Number of “AMR” blocks: 1, 10, 100, 1000, 10000 (each with 256 zones) – Emulates 2D block in FLASH • Tested with 1, 2, 4, and 10 (CPU) OpenMP threads for each block count
  • 20. 20 J. Austin Harris --- OpenPOWER ADG 2018 OpenACC vs OpenMP 4.5 • DISCLAIMER: • At the time these tests were performed: – PGI’s OpenACC implementation had a mature API (version 16.10) – IBM’s XL Fortran implementation of OpenMP 4.5 (version 16.1) • Beta version of the compiler • Did not allow pinned memory or asynchronous data transfers / kernel execution
  • 21. 21 J. Austin Harris --- OpenPOWER ADG 2018 Results • For high numbers of AMR blocks, OpenACC is roughly 3x faster More complicated behavior for lower block counts
  • 22. 22 J. Austin Harris --- OpenPOWER ADG 2018 OpenACC at low block counts • At low AMR block counts, kernel overhead is large relative to compute time and increased work does little to increase total performance. ~0.1 ms
  • 23. 23 J. Austin Harris --- OpenPOWER ADG 2018 OpenACC kernel overheads continued • Multiple CPU threads stagger H2D transfers, exacerbating kernel overhead delay
  • 24. 24 J. Austin Harris --- OpenPOWER ADG 2018 OpenACC at high block counts • At higher block counts, kernel overhead is negligible; now dominated by D2H transfers
  • 25. 25 J. Austin Harris --- OpenPOWER ADG 2018 OpenMP at low block counts • There is no asynchronous GPU execution, i.e. the work enqueued by each CPU thread is serialized on the device. • Performance is proportionally less than the OpenACC asynchronous execution.
  • 26. 26 J. Austin Harris --- OpenPOWER ADG 2018 OpenMP at higher block counts • Lack of asynchronous execution becomes less important, as the device compute capability is saturated. • D2H (and H2D) transfers are significantly slower than for OpenACC, as we here lack the ability to pin CPU memory.
  • 27. 27 J. Austin Harris --- OpenPOWER ADG 2018 Optimal GPU configuration • Clear advantage from GPUs when >100 AMR blocks in 2D (or >6 in 3D) • Can calculate 100 2D blocks with GPU in roughly the same time as 1 2D block without • So in FLASH, we should compute 100s to 1000s of 2D blocks per MPI rank, depending on available memory
  • 28. 28 J. Austin Harris --- OpenPOWER ADG 2018 EOS Summary • OpenMP provides an effective to path to performance portability, so despite the lower performance here, we plan to test the OpenMP 4.5 implementation in FLASH production. – Primary factors affecting current OpenMP performance are the serialization of kernels on the device and high data transfer times associated with having to use pageable memory when using OpenMP 4.5. These are technical problems that are certainly surmountable. • In general, we find that the best balance between CPU threads and block number occurs at and above 2-4 CPU threads and roughly 1,000 2D blocks. We can retire all 1,000 of these EOS evaluations in a time less than 10x the fastest 100-block calculation for both OpenACC and OpenMP. – This mode is congruent with our planned production use of FLASH on the OLCF Summit machine, where we will place 3 MPI ranks on each CPU socket, each bound to one of the three available, closely coupled GPUs.
  • 29. 29 J. Austin Harris --- OpenPOWER ADG 2018 Conclusions • Overall, very positive experience with Summit – Some issues with parallel HDF5 under investigation • With the upgrades to the nuclear burning and EOS in FLASH, we find significant speedup (2x - 3x) relative to the CPU alone • Still plenty of work to do! – Hydrodynamics – Gravity – Radiation Transport