SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Utilizing AMD GPUs: Tuning, programming models, and
roadmap
FOSDEM’22 HPC, Big Data and Data Science devroom
February 6th, 2022
George S. Markomanolis
Lead HPC Scientist, CSC – IT Center For Science Ltd.
LUMI
2
AMD GPUs (MI100 example)
3
AMD MI100
Introduction to HIP
• Radeon Open Compute Platform (ROCm)
• HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA platforms
• HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed directly in HIP
• The supported CUDA API is called with HIP prefix (cudamalloc -> hipmalloc)
https://github.com/ROCm-Developer-Tools/HIP
4
Benchmark MatMul cuBLAS, hipBLAS
• Use the benchmark https://github.com/pc2/OMP-Offloading
• Matrix multiplication of 2048 x 2048, single precision
• All the CUDA calls were converted and it was linked with hipBlas
5
0
5000
10000
15000
20000
25000
V100 MI100
GFLOP/s
GPU
Matrix Multiplication (SP)
N-BODY SIMULATION
• N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2
• 171 CUDA calls converted to HIP without issues, close to 1000 lines of code
• 32768 number of small particles, 2000 time steps
• Tune the number of threads equal to 256 than 1024 default at ROCm 4.1
6
0
20
40
60
80
100
120
V100 MI100 MI100*
Seconds
GPU
N-Body Simulation
BabelStream
7
• A memory bound benchmark from the university of Bristol
• Five kernels
o add (a[i]=b[i]+c[i])
o multiply (a[i]=b*c[i])
o copy (a[i]=b[i])
o triad (a[i]=b[i]+d*c[i])
o dot (sum = sum+d*c[i])
Improving OpenMP performance on BabelStream for MI100
• Original call:
#pragma omp target teams distribute parallel for simd
• Optimized call
#pragma omp target teams distribute parallel for simd thread_limit(256) num_teams(240)
• For the dot kernel we used 720 teams
8
Mixbench
• The purpose of this benchmark tool is to evaluate performance bounds of GPUs on
mixed operational intensity kernels.
• The executed kernel is customized on a range of different operational intensity values.
• Supported programming models: CUDA, HIP, OpenCL and SYCL
• We use three types of experiments combined with global memory accesses:
o Single precision Flops (multiply-additions)
o Double precision Flops (multiply-additions)
o Half precision Flops (multiply-additions)
• Following results present peak performance
• Source: https://github.com/ekondis/mixbench
9
Mixbench
10
Mixbench
11
Mixbench
12
Programming Models
• We have utilized with success at least the following programming models/interfaces
on AMD MI100 GPU:
oHIP
oOpenMP Offloading
ohipSYCL
oKokkos
oAlpaka
13
SYCL (hipSYCL)
• C++ Single-source Heterogeneous Programming for Acceleration Offload
• Generic programming with templates and lambda functions
• Big momentum currently, NERSC, ALCF, Codeplay partnership
• SYCL 2020 specification was announced early 2021
• Terminology: Unified Shared Memory (USM), buffer, accessor, data movement, queue
• hipSYCL supports CPU, AMD/NVIDIA GPUs, Intel GPU (experimental)
• https://github.com/illuhad/hipSYCL
14
Kokkos
• Kokkos Core implements a programming model in C++ for writing performance portable applications targeting
all major HPC platforms. It provides abstractions for both parallel execution of code and data management.
(ECP/NNSA)
• Terminology: view, execution space (serial, threads, OpenMP, GPU,…), memory space (DRAM, NVRAM,
…), pattern, policy
• Supports: CPU, AMD/NVIDIA GPUs, Intel KNL etc.
• https://github.com/kokkos
15
Alpaka
• Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is a header-only
C++14 abstraction library for accelerator development. Developed by HZDR.
• Similar to CUDA terminology, grid/block/thread plus element
• Platform decided at the compile time, single source interface
• Easy to port CUDA codes through CUPLA
• Terminology: queue (non/blocking), buffers, work division
• Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU) etc.
• https://github.com/alpaka-group/alpaka
16
BabelStream Results
17
AMD Instinct MI250X
• Two graphics compute dies (GCDs)
• 64GB of HBM2e memory per GCD (total 128GB)
• 26.5 TFLOPS peak performance per GCD
• 1.6 TB/s memory bandwidth per GCD
• 110 CU per GCD, totally 220 CU per GPU
• Both GCDs are interconnected with 200 GB/s per direction
• The interconnection is attached on the GPU (not on the CPU)
18
MI250X
19
Using MI250X
• Utilize CRAY MPICH with GPU Support (export MPICH_GPU_SUPPORT_ENABLED=1)
• Use 1 MPI process per GCD, so 2 MPI processes per GPU and 8 MPI processes per node, if you plan to
utilize 4 GPUs
• MI250x can have multiple contexts sharing in the same GPU , thus supports many MPI processes per
GPU by default
• Be careful with contention as multiple contexts share resources
• If the applications requires it, use different number of MPI processes
20
OpenACC
• GCC will provide OpenACC (Mentor Graphics contract, now called Siemens EDA).
Checking functionality
• HPE is supporting OpenACC v2.6 for Fortran. This is quite old OpenACC version.
HPE announced that they will not support OpenACC for C/C++
• Clacc from ORNL: https://github.com/llvm-doe-org/llvm-project/tree/clacc/master
OpenACC from LLVM only for C (Fortran and C++ in the future)
oTranslate OpenACC to OpenMP Offloading
• If the code is in Fortran, we could use GPUFort
21
Clacc
$ clang -fopenacc-print=omp -fopenacc-structured-ref-count-omp=no-hold -fopenacc-
present-omp=no-present jacobi.c
Original code:
#pragma acc parallel loop reduction(max:lnorm) private(i,j) 
present(newarr, oldarr) collapse(2)
for (i = 1; i < nx + 1; i++) {
for (j = 1; j < ny + 1; j++) {
New code:
#pragma omp target teams map(alloc: newarr,oldarr) map(tofrom: lnorm)
shared(newarr,oldarr) firstprivate(nx,ny,factor) reduction(max: lnorm) 
#pragma omp distribute private(i,j) collapse(2)
for (i = 1; i < nx + 1; i++) {
for (j = 1; j < ny + 1; j++) {
22
Results of BabelStream on NVIDIA V100
23
24
GPUFort – Fortran with OpenACC (1/2)
25
Ifdef original file
GPUFort – Fortran with OpenACC (2/2)
26
Extern C
routine
Kernel
Porting diagram and Software Roadmap
27
Tuning
• Multiple wavefronts per compute unit (CU) is important to hide latency and instruction throughput
• Tune number of threads per block, number of teams for OpenMP offloading and other programming models
• Memory coalescing increases bandwidth
• Unrolling loops allow compiler to prefetch data
• Small kernels can cause latency overhead, adjust the workload
• Use of Local Data Share (LDS) memory
• Profiling, this could be a bit difficult without proper tools
28
Conclusion/Future work
• A code written in C/C++ and MPI+OpenMP is a bit easier to be ported to OpenMP offloading compared to other
approaches.
• The hipSYCL, Kokos, and Alpaka could be a good option considering that the code is in C++.
• There can be challenges, depending on the code and what GPU functionalities are integrated to an application
• It will be required to tune the code for high occupancy
• Track historical performance among new compilers
• GCC for OpenACC and OpenMP Offloading for AMD GPUs (issues will be solved with GCC 12.x and LLVM
13.x)
• Tracking how profiling tools work on AMD GPUs (rocprof, TAU, Score-P, HPCToolkit)
• Paper “Evaluating GPU programming models for the LUMI Supercomputer” will be presented at Supercomputing
Asia 2022
29
www.lumi-supercomputer.eu
contact@lumi-supercomputer.eu
Follow us
Twitter: @LUMIhpc
LinkedIn: LUMI supercomputer
YouTube: LUMI supercomputer
georgios.markomanolis@csc.fi
CSC – IT Center for Science Ltd.
Lead HPC Scientist
George Markomanolis

Contenu connexe

Tendances

03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_FinalGopi Krishnamurthy
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Community
 
Delivering the Future of High-Performance Computing
Delivering the Future of High-Performance ComputingDelivering the Future of High-Performance Computing
Delivering the Future of High-Performance ComputingAMD
 
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV ClusterMethod of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Clusterbyonggon chun
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
 
LCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLinaro
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationSubhajit Sahu
 
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor CoreAMD
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
Heterogeneous Integration with 3D Packaging
Heterogeneous Integration with 3D PackagingHeterogeneous Integration with 3D Packaging
Heterogeneous Integration with 3D PackagingAMD
 
CE-4028, Miracast with AMD Wireless Display technology – Kickass gaming and o...
CE-4028, Miracast with AMD Wireless Display technology – Kickass gaming and o...CE-4028, Miracast with AMD Wireless Display technology – Kickass gaming and o...
CE-4028, Miracast with AMD Wireless Display technology – Kickass gaming and o...AMD Developer Central
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIOHisaki Ohara
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)Akhila Dakshina
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD
 

Tendances (20)

03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Delivering the Future of High-Performance Computing
Delivering the Future of High-Performance ComputingDelivering the Future of High-Performance Computing
Delivering the Future of High-Performance Computing
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV ClusterMethod of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
LCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLCA13: Power State Coordination Interface
LCA13: Power State Coordination Interface
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : Presentation
 
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Heterogeneous Integration with 3D Packaging
Heterogeneous Integration with 3D PackagingHeterogeneous Integration with 3D Packaging
Heterogeneous Integration with 3D Packaging
 
CE-4028, Miracast with AMD Wireless Display technology – Kickass gaming and o...
CE-4028, Miracast with AMD Wireless Display technology – Kickass gaming and o...CE-4028, Miracast with AMD Wireless Display technology – Kickass gaming and o...
CE-4028, Miracast with AMD Wireless Display technology – Kickass gaming and o...
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
Introduction to Linux Drivers
Introduction to Linux DriversIntroduction to Linux Drivers
Introduction to Linux Drivers
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
Graphics card
Graphics cardGraphics card
Graphics card
 

Similaire à Utilizing AMD GPUs: Tuning, programming models, and roadmap

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a boxYutaka Kawai
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++JetBrains
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Linaro
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...Edge AI and Vision Alliance
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 

Similaire à Utilizing AMD GPUs: Tuning, programming models, and roadmap (20)

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
LEGaTO Integration
LEGaTO IntegrationLEGaTO Integration
LEGaTO Integration
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++
 
B9 cmis
B9 cmisB9 cmis
B9 cmis
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 

Plus de George Markomanolis

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IGeorge Markomanolis
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIGeorge Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IGeorge Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerGeorge Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

Plus de George Markomanolis (15)

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Dernier

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Utilizing AMD GPUs: Tuning, programming models, and roadmap

  • 1. Utilizing AMD GPUs: Tuning, programming models, and roadmap FOSDEM’22 HPC, Big Data and Data Science devroom February 6th, 2022 George S. Markomanolis Lead HPC Scientist, CSC – IT Center For Science Ltd.
  • 3. AMD GPUs (MI100 example) 3 AMD MI100
  • 4. Introduction to HIP • Radeon Open Compute Platform (ROCm) • HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs • It is a C++ runtime API and it supports both AMD and NVIDIA platforms • HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs • Many well-known libraries have been ported on HIP • New projects or porting from CUDA, could be developed directly in HIP • The supported CUDA API is called with HIP prefix (cudamalloc -> hipmalloc) https://github.com/ROCm-Developer-Tools/HIP 4
  • 5. Benchmark MatMul cuBLAS, hipBLAS • Use the benchmark https://github.com/pc2/OMP-Offloading • Matrix multiplication of 2048 x 2048, single precision • All the CUDA calls were converted and it was linked with hipBlas 5 0 5000 10000 15000 20000 25000 V100 MI100 GFLOP/s GPU Matrix Multiplication (SP)
  • 6. N-BODY SIMULATION • N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2 • 171 CUDA calls converted to HIP without issues, close to 1000 lines of code • 32768 number of small particles, 2000 time steps • Tune the number of threads equal to 256 than 1024 default at ROCm 4.1 6 0 20 40 60 80 100 120 V100 MI100 MI100* Seconds GPU N-Body Simulation
  • 7. BabelStream 7 • A memory bound benchmark from the university of Bristol • Five kernels o add (a[i]=b[i]+c[i]) o multiply (a[i]=b*c[i]) o copy (a[i]=b[i]) o triad (a[i]=b[i]+d*c[i]) o dot (sum = sum+d*c[i])
  • 8. Improving OpenMP performance on BabelStream for MI100 • Original call: #pragma omp target teams distribute parallel for simd • Optimized call #pragma omp target teams distribute parallel for simd thread_limit(256) num_teams(240) • For the dot kernel we used 720 teams 8
  • 9. Mixbench • The purpose of this benchmark tool is to evaluate performance bounds of GPUs on mixed operational intensity kernels. • The executed kernel is customized on a range of different operational intensity values. • Supported programming models: CUDA, HIP, OpenCL and SYCL • We use three types of experiments combined with global memory accesses: o Single precision Flops (multiply-additions) o Double precision Flops (multiply-additions) o Half precision Flops (multiply-additions) • Following results present peak performance • Source: https://github.com/ekondis/mixbench 9
  • 13. Programming Models • We have utilized with success at least the following programming models/interfaces on AMD MI100 GPU: oHIP oOpenMP Offloading ohipSYCL oKokkos oAlpaka 13
  • 14. SYCL (hipSYCL) • C++ Single-source Heterogeneous Programming for Acceleration Offload • Generic programming with templates and lambda functions • Big momentum currently, NERSC, ALCF, Codeplay partnership • SYCL 2020 specification was announced early 2021 • Terminology: Unified Shared Memory (USM), buffer, accessor, data movement, queue • hipSYCL supports CPU, AMD/NVIDIA GPUs, Intel GPU (experimental) • https://github.com/illuhad/hipSYCL 14
  • 15. Kokkos • Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. It provides abstractions for both parallel execution of code and data management. (ECP/NNSA) • Terminology: view, execution space (serial, threads, OpenMP, GPU,…), memory space (DRAM, NVRAM, …), pattern, policy • Supports: CPU, AMD/NVIDIA GPUs, Intel KNL etc. • https://github.com/kokkos 15
  • 16. Alpaka • Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is a header-only C++14 abstraction library for accelerator development. Developed by HZDR. • Similar to CUDA terminology, grid/block/thread plus element • Platform decided at the compile time, single source interface • Easy to port CUDA codes through CUPLA • Terminology: queue (non/blocking), buffers, work division • Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU) etc. • https://github.com/alpaka-group/alpaka 16
  • 18. AMD Instinct MI250X • Two graphics compute dies (GCDs) • 64GB of HBM2e memory per GCD (total 128GB) • 26.5 TFLOPS peak performance per GCD • 1.6 TB/s memory bandwidth per GCD • 110 CU per GCD, totally 220 CU per GPU • Both GCDs are interconnected with 200 GB/s per direction • The interconnection is attached on the GPU (not on the CPU) 18
  • 20. Using MI250X • Utilize CRAY MPICH with GPU Support (export MPICH_GPU_SUPPORT_ENABLED=1) • Use 1 MPI process per GCD, so 2 MPI processes per GPU and 8 MPI processes per node, if you plan to utilize 4 GPUs • MI250x can have multiple contexts sharing in the same GPU , thus supports many MPI processes per GPU by default • Be careful with contention as multiple contexts share resources • If the applications requires it, use different number of MPI processes 20
  • 21. OpenACC • GCC will provide OpenACC (Mentor Graphics contract, now called Siemens EDA). Checking functionality • HPE is supporting OpenACC v2.6 for Fortran. This is quite old OpenACC version. HPE announced that they will not support OpenACC for C/C++ • Clacc from ORNL: https://github.com/llvm-doe-org/llvm-project/tree/clacc/master OpenACC from LLVM only for C (Fortran and C++ in the future) oTranslate OpenACC to OpenMP Offloading • If the code is in Fortran, we could use GPUFort 21
  • 22. Clacc $ clang -fopenacc-print=omp -fopenacc-structured-ref-count-omp=no-hold -fopenacc- present-omp=no-present jacobi.c Original code: #pragma acc parallel loop reduction(max:lnorm) private(i,j) present(newarr, oldarr) collapse(2) for (i = 1; i < nx + 1; i++) { for (j = 1; j < ny + 1; j++) { New code: #pragma omp target teams map(alloc: newarr,oldarr) map(tofrom: lnorm) shared(newarr,oldarr) firstprivate(nx,ny,factor) reduction(max: lnorm) #pragma omp distribute private(i,j) collapse(2) for (i = 1; i < nx + 1; i++) { for (j = 1; j < ny + 1; j++) { 22
  • 23. Results of BabelStream on NVIDIA V100 23
  • 24. 24
  • 25. GPUFort – Fortran with OpenACC (1/2) 25 Ifdef original file
  • 26. GPUFort – Fortran with OpenACC (2/2) 26 Extern C routine Kernel
  • 27. Porting diagram and Software Roadmap 27
  • 28. Tuning • Multiple wavefronts per compute unit (CU) is important to hide latency and instruction throughput • Tune number of threads per block, number of teams for OpenMP offloading and other programming models • Memory coalescing increases bandwidth • Unrolling loops allow compiler to prefetch data • Small kernels can cause latency overhead, adjust the workload • Use of Local Data Share (LDS) memory • Profiling, this could be a bit difficult without proper tools 28
  • 29. Conclusion/Future work • A code written in C/C++ and MPI+OpenMP is a bit easier to be ported to OpenMP offloading compared to other approaches. • The hipSYCL, Kokos, and Alpaka could be a good option considering that the code is in C++. • There can be challenges, depending on the code and what GPU functionalities are integrated to an application • It will be required to tune the code for high occupancy • Track historical performance among new compilers • GCC for OpenACC and OpenMP Offloading for AMD GPUs (issues will be solved with GCC 12.x and LLVM 13.x) • Tracking how profiling tools work on AMD GPUs (rocprof, TAU, Score-P, HPCToolkit) • Paper “Evaluating GPU programming models for the LUMI Supercomputer” will be presented at Supercomputing Asia 2022 29
  • 30. www.lumi-supercomputer.eu contact@lumi-supercomputer.eu Follow us Twitter: @LUMIhpc LinkedIn: LUMI supercomputer YouTube: LUMI supercomputer georgios.markomanolis@csc.fi CSC – IT Center for Science Ltd. Lead HPC Scientist George Markomanolis