SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Brief summary of byteLAKE’s experience on NVIDIA’s architecture.
This is an appendix to the one-slider about the same.
pl. Solny 14/3
50-062 Wroclaw, Poland
+48 508 091 885
+48 505 322 282
+1 650 735 2063
byteLAKE on NVIDIA: EXPERIENCE Jun-18 2
• parallelization of the EULAG model (i.e. weather simulations)
• porting of various applications / algorithms to HPC (CPU + GPU) architectures.
About EULAG: that particular model has a proven record of successful applications, and excellent
efficiency as well as scalability on conventional supercomputer architectures. For instance it is being
implemented as the new dynamical core of the COSMO weather prediction framework.
Expertise in CUDA, OpenCL, OpenACC and NVIDIA hardware
(from supercomputers to small embedded devices).
• NVIDIA’s architectures like Kepler (i.e. K80 for servers, GeForce GTX Titan for desktop, Jetson
for mobile), Maxwell (i.e. NVIDIA GeForce GTX 980 for desktop), Pascal (i.e. P100 for servers)
• we have been working on NVIDIA’s platforms starting from Tesla architecture (i.e. C1060 card;
year of 2008) and Fermi architecture (i.e. C2050 card).
• We have also started running technical courses on Volta architectures (i.e. Tesla V100).
byteLAKE on NVIDIA: EXPERIENCE Jun-18 3
More about HPC weather simulations:
• we have done a lot of work here in the areas of analyzing the overall algorithm’s resources
usage and their influence on the system performance.
• based on that, we removed bottlenecks and eventually developed a method of efficient
distribution of computation across GPU kernels.
• our method analyzes memory transactions between GPU global and shared memories. That
helps us deploy various strategies to accelerate the code execution, namely stencil
decomposition, block decomposition (with weighting analysis between computation and
communication), reduce inter-memory communication, and register file reusing.
• besides, we also applied additional optimization techniques including 2.5D blocking, coalesced
memory access, padding, and providing a high GPU occupancy, as well as algorithm-specific
optimizations such as rearrangement of boundary conditions (i.e. to reduce the branch
divergence), and management of exchanging halo areas between graphics processors within a
• all of these helped us significantly improve the overall performance of the simulation
• on top of these, we have built an auto-tuning procedure (machine learning based) that allowed
us to automate the adaptation of the simulation to a set of GPUs, taking their individual
characteristics into account (algorithm/GPU specific parameters incl. sizes of compute unified
device architecture (CUDA) block for each kernel of the algorithm, size of data alignment
boundary for each algorithm’s array, configuration of GPU-shared memory, cached or non-
cached memory access, and CUDA compute capability setting).
Results of the HPC weather simulations improvements:
• We have experimentally validated our methods for NVIDIA Kepler-based GPUs (incl. Tesla
K20X, GeForce GTX TITAN, a single Tesla K80 GPU, and multi-GPU system with two K80 cards,
as well as GeForce GTX 980 GPU based on the NVIDIA Maxwell architecture).
• Depending on the grid size and device architecture, our method allowed us to achieve a speed-
up over the basic version of the HPC simulation (without auto-tuning mechanism) from 1.1 for
GeForce GTX 980 to 1.92 for 2xTesla K80 GPU (side note: low speed-up for GeForce GTX 980 is
• Then we also focused on an inter- and intra- node overlapping between data transfers and GPU
computations for the GPU-accelerated cluster.
byteLAKE on NVIDIA: EXPERIENCE Jun-18 4
• For the Piz Daint cluster (equipped with NVIDIA Tesla K20 GPUs – 2015 year), our approach
allowed us to achieve a weak scalability up to 136 nodes. The obtained performance exceeded
16 TFop/s in double precision. All in all our improved code was almost twice faster than the
basic one. Besides performance, we also decreased the energy consumption. Therefore we
applied a mixed precision arithmetic to the algorithm and managed it dynamically using a
modified version of the random forest (machine learning) algorithm. We deployed it on the Piz
Daint supercomputer (ranked 3rd at the TOP500 list, as of Nov. 2017) which is equipped with
NVIDIA Tesla P100 GPU accelerators that are based on the NVIDIA Pascal architecture.
• We have also deployed it on the MICLAB cluster containing NVIDIA Tesla K80 (NVIDIA Kepler-
based GPU). As a result, we reduced the energy consumption by up to 36%.
Example research publications using NVIDIA hardware:
• Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU
accelerators, Parallel Computing 40(8), 2014, 425-447
• Adaptation of fluid model EULAG to graphics processing unit architecture, Concurrency and
Computation: Practice and Experience 27(4), 2015, 937-957
• Performance modeling of 3D MPDATA simulations on GPU cluster, Journal of
Supercomputing 73(2), 2017, 664-675
• Systematic adaptation of stencil-based 3D MPDATA algorithm to GPU architectures,
Concurrency and Computation: Practice and Experience 29(9), 2017
• Machine learning method for energy reduction by utilizing dynamic mixed precision on GPU-
based supercomputers, Concurrency and Computation: Practice and Experience
byteLAKE on NVIDIA: EXPERIENCE Jun-18 5
Let’s stay in touch: welcome@byteLAKE.com
byteLAKE on NVIDIA: EXPERIENCE Jun-18 6
Learn how we work:
We start with a consultancy
session to better understand our
client’s requirements &
We thoroughly analyze the
gathered information and
prepare a draft offer.
We fine tune the offer further
and wrap up everything into a
Finally, the execution starts. We
deliver projects in a fully
transparent, Agile (SCRUM-
byteLAKE on NVIDIA: EXPERIENCE Jun-18 7
We build Artificial Intelligence
software and integrate that into
We port and optimize algorithms
for parallel, CPU+GPU HPC
We deploy AI on data centers, the
cloud and constrained, embedded
devices (AI on Edge).
We are specialists in:
Helping companies transform
for the era of Artificial Intelligence.
We are a team of scientists, programmers, designers and
technology enthusiasts helping industries incorporate
AI techniques into products.
High Performance Computing