OpenACC Monthly Highlights Summer 2019

Summer 2019
OPENACC MONTHLY
HIGHLIGHTS

2
WHAT IS OPENACC?
main()
{
<serial code>
#pragma acc kernels
{
<parallel code>
}
}
Add Simple Compiler Directive
POWERFUL & PORTABLE
Directives-based
programming model for
parallel
computing
Designed for
performance and
portability on
CPUs and GPUs
SIMPLE
Open Specification Developed by OpenACC.org Consortium

3
silica IFPEN, RMM-DIIS on P100
OPENACC GROWING MOMENTUM
Wide Adoption Across Key HPC Codes
ANSYS Fluent
Gaussian
VASP
LSDalton
MPAS
GAMERA
GTC
XGC
ACME
FLASH
COSMO
Numeca
200 APPS* USING OpenACC
Prof. Georg Kresse
Computational Materials Physics
University of Vienna
For VASP, OpenACC is the way forward for GPU
acceleration. Performance is similar to CUDA, and
OpenACC dramatically decreases GPU
development and maintenance efforts. We’re
excited to collaborate with NVIDIA and PGI as an
early adopter of Unified Memory.
“ “
VASP
Top Quantum Chemistry and Material Science Code
* Applications in production and development

4
DON’T MISS THESE UPCOMING EVENTS
COMPLETE LIST OF EVENTS
Event Call Closes Event Date
National Supercomputing Center in Shenzhen, China July 1, 2019 August 12-16, 2019
University of Sheffield (UK) June 16, 2019 August 19-23, 2019
GPU Bootcamp at RIKEN (R-CCS) August 26, 2019 September 3, 2019
Nat’l Center for Supercomputing Applications (NCSA) July 8, 2019 September 9-13, 2019
C-DAC, Pune, India August 9, 2019 September 14-18, 2019
Brookhaven GPU Hackathon June 30, 2019 September 23-27, 2019
Swiss National Supercomputing Center (Switzerland) July 7, 2019 September 30-October 4, 2019
Oak Ridge National Laboratory (OLCF) August 16, 2019 October 21-25, 2019

5
PGI 19.7 NOW AVAILABLE!
LEARN MORE
New features include support for:
● OpenACC Auto-compare—Detect diverging results between
CPU and GPU or multiple CPU code versions
● CUDA FORTRAN—16-bit REAL(2) data type for V100 tensor
core operations, optimized array assignment-based data
movement.
● PGI on AWS—Develop, test, benchmark, deploy on V100s for
as little as $3/Hour
● Additional FORTRAN 2008 features—g0 edit descriptor,
multiple sourced allocation, vector norm2 and several other
features.
● C++ --now interoperable with GNU releases through GCC 9.1
● LLVM 8.0—the default LLVM back-end for Linux on x86-64
and OpenPOWER is updated from LLVM 7.0 to 8.0.

6
LEARN MORE
Hosted by RIKEN Center for Computational Science (RIKEN R-
CCS) in Kobe, Japan, the 2019 Annual Meeting will bring
together researchers and developers to discuss how to improve
the specification, help accelerate scientific efforts using the
OpenACC programming model, and grow the OpenACC
organization and community.
ATTEND THIS SEMINAL EVENT:
2019 OPENACC ANNUAL MEETING
Agenda includes:
• Keynotes and invited talks from recognized experts
across multiple disciplines of science
• User feedback session
• GPU Bootcamp
• Networking event

7
2019 SPEAKERS: LUMINARIES ACROSS
ACADEMIA, RESEARCH AND INDUSTRY
SEE FULL AGENDA
Opening Remarks
Mitsuhisa Sato
Deputy Director, RIKEN
Center for Computational
Science (RIKEN R-CCS)
Keynote
Satoshi Matsuoka
Director, RIKEN Center for
Computational Science
(RIKEN R-CCS)
Keynote
Jack Wells
Director of Science
Oak Ridge Leadership
Computing Facility (OLCF)
RIKEN Center for Computational
Science (RIKEN R-CCS) • Oak
Ridge National Laboratory (ORNL)
• National Institutes for Quantum
and Radiological Science and
Technology (QST) • Japan Atomic
Energy Agency (JAEA) •
University of Tsukuba • Indian
Institute of Technology Bombay,
Mumbai • University of Tokyo •
Osaka University • National Center
for High-Performance Computing
(NCHC), Taiwan
Organizations speaking:

8
GET HANDS-ON WITH GPU BOOTCAMP
APPLICATION DEADLINE: AUGUST 23, 2019
APPLY TO ATTEND
GPU Bootcamp is an exciting and unique way for
scientists and researchers to learn the skills needed
to start quickly accelerating codes on GPUs.
This one-day event will introduce you to available
GPU libraries, programming models, and platforms
where you will learn the basics of GPU programming
through extensive hands-on collaboration based on a
real-life code using the OpenACC programming
model.

9
CALL FOR PAPERS:
SIXTH WORKSHOP ON ACCELERATOR
PROGRAMMING USING DIRECTIVES
LEARN MORE
Co-located with SC19, WACCPD has been one of the
major forums to bring together programming model
users, developers, and tools community to share
knowledge and experiences to tackle emerging
complex parallel computing systems.
The workshop highlights the state-of-art through
accepted papers, showcases all aspects of
heterogeneous systems, and discusses innovative
features, techniques and lessons learned.
SUBMISSION DEADLINE: AUGUST 22, 2019

10
RESOURCES
Paper: pointerchain: Tracing pointers to their roots –
A case study in molecular dynamics simulations
Millad Ghane, Sunita Chandrasekaran, and Margaret S. Cheung
As scientific frameworks become sophisticated, so do their data structures. A data structure typically
includes pointers and arrays to other structures in order to preserve application’s state. In order to ensure
data consistency from a scientific application on a modern high performance computing (HPC)
architecture, the management of such pointers on the host and the device, has become complicated in
terms of memory allocations because they occupy separate memory spaces. It becomes so severe that
one must go through a chain of pointers to extract the effective address. In this paper, we propose to
reduce the need of excessive data transfer by introducing the idea of pointerchain, a directive that
replaces the pointer chains with their corresponding effective address inside the parallel region of a code.
Based on our analysis, pointerchain leads to a 39% and 38% reduction in the amount of generated
codes and the total executed instructions, respectively.
With pointerchain, we have parallelized CoMD, a Molecular Dynamics (MD) proxy application on
heterogeneous HPC architectures while maintaining a single portable codebase. This portable codebase
utilizes OpenACC, an emerging directive-based programming model, to address the need of memory
allocations from three computational kernels in CoMD. Two of the three embarrassingly parallel kernels
highly benefit from OpenACC and perform better than the hand-written CUDA counterparts. The third
kernel performed 61% of peak performance of its CUDA counterpart. The three kernels are common
modules in any MD simulations. Our findings provides useful insights into parallelizing legacy MD
software across heterogeneous platforms.
VIEW NOW
Fig. 1. An example of a pointer chain: an illustration of a data structure and its
children. To reach the position array, the processor must dereference a chain of
pointers to extract the effective address

11
RESOURCES
Paper: Hardware Acceleration of Reaction-Diffusion
Systems: A Guide to Optimisation of Pattern Formation
Algorithms Using OpenACC
Ruth E. Falconer, Alasdair N. Houston, Xavier Portell, and
Wilfred Otten
Reaction Diffusion Systems (RDS) have widespread applications in computational ecology,
biology, computer graphics and the visual arts. For the former applications a major barrier to
the development of effective simulation models is their computational complexity - it takes a
great deal of processing power to simulate enough replicates such that reliable conclusions
can be drawn. Optimizing the computation is thus highly desirable in order to obtain more
results with less resources. Existing optimizations of RDS tend to be low-level and GPGPU
based. Here we apply the higher-level OpenACC framework to two case studies: a simple
RDS to learn the ‘workings’ of OpenACC and a more realistic and complex example. Our
results show that simple parallelization directives and minimal data transfer can produce a
useful performance improvement. The relative simplicity of porting OpenACC code between
heterogeneous hardware is a key benefit to the scientific computing community in terms of
speed-up and portability.
VIEW NOW
Fig. 3. Patterns obtained from GSRD model.

12
RESOURCES
Paper: OpenACC Parallelization of Stochastic
Simulations on GPUs
Pilsung Kang
We present an OpenACC-based parallelization implementation of stochastic
algorithms for simulating biochemical reaction networks on modern GPUs
(graphics processing units). To investigate the effectiveness of using OpenACC
for leveraging the massive hardware parallelism of the GPU architecture, we
carefully apply OpenACC’s language constructs and mechanisms to
implementing a parallel version of stochastic simulation algorithms on the GPU.
Using our OpenACC implementation in comparison to both the NVidia CUDA
and the CPU-based implementations, we report our initial experiences on
OpenACC’s performance and programming productivity in the context of GPU-
accelerated scientific computing.
VIEW NOW
Fig. 1. OpenACC programming model for heterogeneous systems

13
RESOURCES
The Weather Research and Forecasting (WRF) Model is one of the widely-used
mesoscale numerical weather prediction system and is designed for both atmospheric
research and operational forecasting applications. However, it is an extremely time-
consuming application: running a single simulation takes researchers days to weeks as
the simulation size scales up and computing demands grow. In this paper, we port and
optimize the whole WRF model to the Sunway TaihuLight supercomputer at a large
scale. For the dynamic core in WRF, we present a domain-specific tool, namely,
SWSLL, which is a directive-based compiler tool for the Sunway many-core architecture
to convert the stencil computation into optimized parallel code. We also apply a
decomposition strategy for SWSLL to improve the memory locality and decrease the
number of off-chip memory accesses. For physical parameterizations, we explore the
thread-level parallelization using OpenACC directives via reorganizations of data
layouts and loops to achieve high performance. We present the algorithms and
implementations and demonstrate the optimizations of a real-world complicated
atmospheric modeling on the Sunway TaihuLight supercomputer. Evaluation results
reveal that for the widely used benchmark with a horizontal resolution of 2.5 km, the
speedup of 4.7 can be achieved by using the proposed algorithm and optimization
strategies for the whole WRF model. In terms of strong scalability, our implementation
scales well to hundreds of thousands of heterogeneous cores on Sunway TaihuLight. VIEW NOW
Paper: Refactoring and Optimizing WRF Model on
Sunway TaihuLight
Kai Xu, Zhenya Song, Yuandong Chan, Shida Wang,
Xiangxu Meng, Weiguo Liu, and Wei Xue
Fig. 6. Typical 3-D grid computation, which is decomposed into many chunks
based on k or j dimension. Each color in the figure represents one chunk.
Each chunk is further partitioned into multiple blocks based on i dimensions.
Each block is assigned to a CPE core for computation.

14
RESOURCES
Online Video Course: OpenACC Programming
Presented by Appentra
Explore Appentra’s new online video course. Presented as
a series and broken into simple, easy-to-follow steps, these
videos succinctly explain how to quickly get up and running
on OpenACC code parallelization.
Learn best practices for parallel programming using
OpenACC, how to decompose codes into parallel patterns
and a practical step-by-step process based on patterns for
parallelizing any code.
.
WATCH NOW

15
STAY IN THE KNOW:
JOIN THE OPENACC COMMUNITY
JOIN TODAY
The OpenACC specification is designed for, and
by, users meaning that the OpenACC organization
relies on our users’ active participation to shape
the specification and to educate the scientific
community on its use.
Take an active role in influencing the future of both
the OpenACC specification and the organization
itself by becoming a member of the community.

OpenACC Monthly Highlights Summer 2019

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à OpenACC Monthly Highlights Summer 2019

Similaire à OpenACC Monthly Highlights Summer 2019 (19)

Dernier

Dernier (20)

OpenACC Monthly Highlights Summer 2019