SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
GPU programming made easy
with OpenMP
Aditya Nitsure (anitsure@in.ibm.com)
Pidad D’souza (pidsouza@in.ibm.com)
19/03/20 IBM 2
OpenMP (Open Multi-Processing)OpenMP (Open Multi-Processing)
(https://www.openmp.org)
OpenMP (Open Multi-Processing)OpenMP (Open Multi-Processing)
(https://www.openmp.org)
19/03/20 IBM 3
Take away
●
What is OpenMP?
●
How OpenMP programming model works ?
●
How to write, compile and run OpenMP program?
●
What are various OpenMP directives, runtime library
API & Environment variables?
●
How OpenMP program executes on accelerators
(GPUs)?
●
What are the best practices to write OpenMP
program?
19/03/20 IBM 4
Introduction (1)
●
Governed by OpenMP Architecture Review Board (ARB)
●
Open source, simple and up to date with the latest
hardware development
●
Specification for shared memory parallel programming
model for Fortran and C/C++ programming languages

compiler directives

library routines

environment variables
●
Widely accepted by community – academician & industry
19/03/20 IBM 5
Introduction (2)
●
Shorter learning curve
●
Supported by many compiler vendors

GNU, LLVM, AMD, IBM, Cray, Intel, PGI, OpenUH, ARM

Compilers from research labs e.g. LLNL, Barcelona Supercomputing Centre
●
Easy to maintain sequential, CPU parallel and GPU
parallel versions of code.
●
Used in application development from variety of fields and
systems

Aeronautics, Automotive, Pharmaceutics, Finance

Supercomputers, Accelerators, Embedded multi-core systems
19/03/20 IBM 6
Specification History
➢ OpenMP 1.0 (1997 (Fortran) / 1998
(C/C++))
●
First release of Fortran & C/C++ specification (2
separate specifications)
➢ OpenMP 2.0 (2000 (Fortran) / 2002
(C/C++))
●
Addition of timing routines and various new
clauses – firstprivate, lastprivate, copyprivate etc.
➢ OpenMP 2.5 (2005)
●
combined standard for C/C++ & Fortran
●
Introduce notion of internal control variables
(ICVs) and new constructs – single, sections etc.
➢ OpenMP 3.0/3.1 (2008 – 2011)
●
Concept of task added in OpenMP execution
model
➢ OpenMP 4.0 (2013)
●
Accelerator (GPU) support
●
SIMD construct for vectorization of loops
●
Support for FORTRAN 2003
➢ OpenMP 4.5 (2015)
●
Significant improvement for devices (GPU
programming) & Fortran 2003
●
New taskloop construct
➢ OpenMP 5.0 (2018) – compilers are not yet
support full set of features of this standard !
●
Extended memory model to distinguish different
types of flush operations
●
Added target-offload-var ICV and
OMP_TARGET_OFFLOAD environment variable
●
Teams construct extended to execute on host
device
19/03/20 IBM 7
OpenMP Programming (1)
●
The OpenMP API uses the fork-join model of parallel
execution.
●
The OpenMP API provides a relaxed-consistency, shared-
memory model.
●
An OpenMP program begins as a single thread of execution,
called an initial thread. An initial thread executes sequentially.
●
When any thread encounters a parallel construct, the thread
creates a team of itself and zero or more additional threads
and becomes the master of the new team.
19/03/20 IBM 8
OpenMP Programming (2)
●
Compiler directives
– Specified with pragma preprocessing
– Starts with #pragma omp (For C/C++)
– Case sensitive
– Any directive is followed by one or more clauses
– Syntax #pragma omp directive-name [clause[ [,] clause] ... ] new-line
●
Runtime library routines
– The library routines are external functions with “C” linkage
– omp.h provides prototype definition of all library routines
●
Environment variables
– Set at the program start up
– Specifies the settings of the Internal Control Variables (ICVs) that affect the
execution of OpenMP programs
– Always in upper case
19/03/20 IBM 9
OpenMP Hello World
#include <stdio.h>
#include <omp.h>
int main()
{
int i = 0;
int numThreads = 0;
// call to OpenMP runtime library
numThreads = omp_get_num_threads(numThreads);
// OpenMP directive
#pragma omp parallel
{
// call to OpenMP runtime library
int threadNum = omp_get_thread_num();
printf(“Hello World from thread %d n”, threadNum);
}
return 0;
}
Output
[aditya@hpcwsw7 openmp_tutorial]$ ./helloWorld
Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
Hello World from thread 2
[aditya@hpcwsw7 openmp_tutorial]$ ./helloWorld
Hello World from thread 0
Hello World from thread 1
Hello World from thread 2
Hello World from thread 3
// Setting OpenMP environment variable
export OMP_NUM_THREADS=4
//compile helloWorld program
xlc -qsmp=omp helloWorld.c -o helloWorld
// Run helloWorld program
./helloWorld
helloWorld.c
19/03/20 IBM 10
Compilation of OpenMP program
XL
●
CPU
-qsmp=omp
●
GPU
-qsmp=omp -qoffload
-qtgtarch=sm_70 (V100 GPU)
GNU
●
CPU
-fopenmp
●
GPU
-fopenmp
-foffload=nvptx-none
NVIDIA CUDA linking
-lcudart -L/usr/local/cuda/lib64
19/03/20 IBM 11
OpenMP Directives
●
Parallel construct
●
SIMD construct
●
Combined construct
●
Work sharing construct
●
Master and synchronization construct
●
Tasking construct
●
Device construct
19/03/20 IBM 12
Parallel construct
#pragma omp parallel [clause[ [,] clause] ... ] new-line
{
}
●
The fundamental construct that starts parallel execution
●
A team of threads is created to execute parallel region
●
Original thread becomes master of the new team
●
All threads in the team executes parallel region
#pragma omp parallel
Implicit barrier at the
end of parallel region
Original thread
becomes master
thread in parallel
region
#pragma omp parallel
{
int threadNum = omp_get_thread_num();
printf(“Hello World from thread %d n”, threadNum);
}
19/03/20 IBM 13
SIMD construct
●
Applied on loop to transform loop iterations to execute concurrently
using SIMD instructions
●
When any thread encounters a simd construct, the iterations of the
loop associated with the construct may be executed concurrently using
the SIMD lanes that are available to the thread.
#pragma omp simd [clause[ [,] clause] ... ] new-line
{
}
#pragma omp parallel for simd
for (i=0; i<N; i++)
{
a[i] = b[i] * c[i]
}
#pragma omp parallel for simd
Implicit barrier at the end of
parallel region
b0
b1
b2
b3
c0
c1
c2
c3
*
vector unitEach iteration of
each thread
executed
concurrently
19/03/20 IBM 14
Combined construct
●
Combination of more than one construct
– Specifies one construct immediately nested inside another construct
●
Clauses from both constructs are permitted
– With some exceptions e.g. nowait clause cannot be specified in
parallel for or parallel sections
●
Examples
– #pragma omp parallel for
– #pragma omp parallel for simd
– #pragma omp parallel sections
– #pragma omp target parallel for simd
19/03/20 IBM 15
Worksharing construct
●
Distributes the execution of the associated
parallel region among the members of the team
●
Implicit barrier at the end
●
Types
– Loop construct
– Sections construct
– Single construct
– Workshare construct (only in Fortran)
19/03/20 IBM 16
Worksharing : Loop construct
●
Iterations of for-loop distributed among the threads
●
One or more iterations executed in parallel
●
Implicit barrier at the end unless nowait
#pragma omp for [clause[ [,] clause] ... ] new-line
// for-loop
{
}
int N = 200;
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++)
{
a[i] = b[i] * c[i]
}
}
#pragma omp parallel
#pragma omp for
Implicit barrier at the
end of for-loop
Iteration space
is divided
among threads
(0-39)
(40-79)
(80-119)
(120-159)
(160-199)
19/03/20 IBM 17
Worksharing : Section construct
●
Non iterative worksharing construct
●
Each section structured block is
executed once
●
The section structured blocks are
executed in parallel (one thread
per section)
●
Implicit barrier at the end unless
nowait specified
#pragma omp sections [clause[ [,] clause] ... ] new-line
{
#pragma omp section new-line
// structured block
#pragma omp section new-line
// structured block
}
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
// do_work1
}
#pragma omp section
{
// do_work2
}
}
}
#pragma omp parallel
#pragma omp sections
Implicit barrier at the
end of sections clause
section divided
among threads
section1 section2 section3
19/03/20 IBM 18
Worksharing : Single construct
●
Only one thread in the team executes the associated structured block
●
Implicit barrier at the end
●
All other threads wait until single block is finished
●
nowait clause removes barrier at the end of single construct
#pragma omp single [clause[ [,] clause] ... ] new-line
{
} #pragma omp parallel
end of parallel region
Single section
Only one thread
executes this section Barrier at the end of
single section
#pragma omp single
#pragma omp parallel
{
// do_some_work()
#pragma omp single
{
//write result to file
}
// do_post_work()
}
19/03/20 IBM 19
Tasking construct
●
Defines an explicit task
●
A task is generated and corresponding data environment is created from the
associated structured block
●
Execution of task could be immediate or defer for later execution
●
Examples : While-loop, recursion, sorting algorithms
#pragma omp task [clause[ [,] clause] ... ] new-line
{
} #pragma omp parallel
{
#pragma omp single
{
While ( ptr != null)
{
#pragma omp task
{
//do_some_work();
}
} // end of while
} // end of single
} // end of parallel
#pragma omp parallel
end of parallel region
Task scheduler schedules
tasks with available threads
#pragma omp single
end of single
Single thread
creates tasks
19/03/20 IBM 20
Synchronization construct (1)
#pragma omp parallel
shared (sum) private (localSum)
{
#pragma omp for
for (i=0; i<N; i++)
{
localSum += a[i];
}
#pragma omp critical (global_sum)
{
sum += localSum;
}
}
Critical section
One thread at a time
4 threads executes
sequentially
#pragma omp parallel
end of parallel region
end of for
Critical construct Master construct Ordered construct
#pragma omp parallel
{
#pragma omp for
// do_pre_work()
#pragma omp master
{
//write result to file
}
#pragma omp for
// do_post_work()
}
end of for
end of for
#pragma omp parallel
end of parallel region
Master section
Only master
thread executes
this section
No entry barrier
No exit barrier
#pragma omp parallel
{
#pragma omp for
for (i=0; i<N; i++)
{
// do_some_work
#pragma omp ordered
{
sum += a[i];
}
// do_post_work
}
}
#pragma omp parallel
end of parallel region
Ordered section
Iteration order is preserved
Start of Ordered
construct
End of Ordered
construct
End of for
19/03/20 IBM 21
Synchronization construct (2)
●
Barrier construct
– #pragma omp barrier new-line
– Stand alone directive
– “Must to execute” for all threads
in parallel region
– No one continues unless all
threads reach barrier
●
Taskwait construct
– #pragma omp taskwait new-line
– Stand alone directive
– Waits until all child tasks
completes execution before
taskwait region
●
Atomic construct
– #pragma omp atomic [atomic-clause]
new-line
– ensures that a specific storage location is
accessed atomically
– Enforces atomicity for read, write, update
or capture
●
Flush construct
– #pragma omp flush [(list)] new-line
– Stand alone directive
– Makes temporary view of thread’s
memory consistent with main memory
– When list is specified, the flush operation
applies to the items in the list. Otherwise
to all thread visible data items.
19/03/20 IBM 22
GPU offloading using OpenMPGPU offloading using OpenMP
19/03/20 IBM 23
Device construct (1)
#pragma omp parallel for
Implicit barrier
#pragma omp parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
for-loop
CPU (no threading) GPU OpenMPCPU OpenMP
#pragma omp target
#pragma omp parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
#pragma omp target
#pragma omp parallel for
Target construct
●
Execute construct on device (GPU)
●
Maps variables to device data environment
Note
●
No thread migration from one device to another.
●
In absence of device or unsupported implementation
for the target device, all target regions execute on the
host.
19/03/20 IBM 24
1. Thread starts on the host and host creates
the data environments on the device(s).
2. The host then maps data to the device data
environment. (data moves to device)
3. Host offloads target regions to target
devices. (code executes on device)
4. After execution, host updates the data
between the host and the device (transfers
data from device to host)
5. Host destroys data environment on the
device
Device Construct (2)
#pragma omp target
#pragma omp parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
#pragma omp target
#pragma omp parallel for
1
2
3
4
5
The host centric execution model
Teams executes asynchronously
Team1 Team2
Threads
within teams
synchronise
CUDA block == Team
19/03/20 IBM 25
Device construct (3)
GPU OpenMP
#pragma omp target teams
#pragma omp parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
#pragma omp target teams
#pragma omp parallel for
Team construct
●
Creates league of thread teams
●
Master thread of each team executes the region
●
No implicit barrier at the end of team construct
#pragma omp target
#pragma omp teams distribute parallel for
for (i=0; i<N; i++)
{
a[i] = b[i] + s * c[i];
}
#pragma omp target
#pragma omp teams distribute
parallel for
Distribute construct
●
Distributes iterations among master thread of all teams
●
No implicit barrier at the end of distribute construct
●
Always associated with loops and binds to team region
team1 team2 team3 team1 team2 team3
19/03/20 IBM 26
Device construct (4)
●
Target data
– #pragma omp target data [clause] new-line
{structured block}
– Maps variables to device data environment and task
encountering the construct executes the region
●
Target enter data / Target exit data
– #pragma omp target enter data [clause] new-line
– #pragma omp target exit data [clause] new-line
– Stand alone directive
– Variables are mapped/unmapped to/from device data
environment
– Generates a new task which enclosed target data region
19/03/20 IBM 27
Device Data Mapping
●
Map clause
– Maps variables from host data environment to
corresponding variables in device data environment
– map([ [map-type-modifier[,]] map-type : ] list)
– map-type
●
to (initializes variables on device variables from host
corresponding variables on host)
●
from (Writes back data from device to host)
●
tofrom – default map type (combination of to & from)
●
alloc (allocates device memory for variables)
●
release (reference count is decremented by one)
●
delete (deletes device memory of variables)
19/03/20 IBM 28
Device construct (5)
19/03/20 IBM 29
Device construct (5)
●
if clause
– if (directive name modifier : scalar expression)
– Used in target data, target enter/exit data, combined
construct
– In case of target clauses if clause expression
evaluates to false then target region executed on
host
●
device (n)
– Specifies the device (GPU) on which the target
region to execute. ‘n’ is device id.
– In absence of device clause, the target region
executes on default device
19/03/20 IBM 30
OpenMP Clauses (1)
●
Schedule [modifier : kind, chunk_size]
– Kind
●
Static : Iterations are divided into chunks and assigned to threads in round-robin faction
●
Dynamic : Iterations are divided into chunks and assigned to threads on dynamically
(first-come-first-serve basis)
●
Guided : Threads grab the chunk of iterations and gradually chunk reduces to
chunk_size as execution proceeds
●
Auto : compiler or OpenMP runtime decides optimum schedule from static, dynamic or
guided
●
Runtime : Deferred until runtime and schedule type and chunk size are taken from
environment variable.
– Modifier (new addition in OpenMP 4.5 specification)
●
simd : The chunk size is adjusted to simd width. Works only with SIMD construct e.g.
schedule(simd:static, 5)
●
Monotonic : Chunks are assigned in increasing logical order of iterations to each
threads.
●
Nonmonotonic : Assignment order of chunks to threads is unspecified.
19/03/20 IBM 31
OpenMP Clauses (2)
●
collapse (n)
– Collapses two or more loops to create larger iteration space
– The parameter (n), specifies the number of loops associated with
loop construct to be collapsed
#pragma omp parallel for collapse (2)
for (i=0; i<N; i++) {
for (j=0; j<M; j++){
for (k=0; k<M; k++){
foo(i, j, k);
}}}
●
nowait
– Used for asynchronous thread movement in worksharing
construct i.e. threads that finish early proceed to the next
instruction without waiting for other thread.
– Used in loop, sections, single, target and so on
19/03/20 IBM 32
Data Sharing Attribute Clauses (1)
●
Private (list)
– Declares variables as to be private to a task or thread
– All references to the original variable are replaced by references to new variable when
respective directive is encountered
– Private variables are not initialized by default
●
shared (list)
– Declares variables to be shared by tasks or threads
– All references to the variable points to the original variable when respective directive is
encountered
●
firstprivate (list)
– Initializes private variable with value of original variable prior to execution of the construct
– Used with parallel, task, target, teams etc.
●
lastprivate (list)
– Writes the value of the private variable back to original variable on existing the respective
construct
– Updated value is sequentially last iteration or section of worksharing construct.
19/03/20 IBM 33
Data Sharing Attribute Clauses (2)
●
reduction (reduction identifier : list)
– Updates original value with the final value of each of the
private copies, using the combiner – operator or identifier
– Reduction identifier is either identifier (max, min) or
operator (+, -, * , &, |, ˆ, && and ||)
●
threadprivate (list)
– Makes global variables (file scope, static) private to each
thread
– Each thread keep its own copy of global variable
19/03/20 IBM 34
Data Copying Clauses
Copy value of private or threadprivate variable from one
thread to another thread
●
copyin(list)
– Allowed on parallel and combined worksharing construct
– Copy value of threadprivate variable of master thread to
threadprivate variable of other threads before start of execution
●
copyprivate(list)
– Allowed on single construct
– Broadcast the value of private variable from one thread to
other threads in team
19/03/20 IBM 35
OpenMP Runtime library (1)
●
Runtime library definitions
– omp.h stores prototype of all OpenMP routines/functions and type.
●
Execution environment routines (35+ routines)
– Routines that affects and monitor threads, processors and parallel
environment
●
void omp_set_num_threads(int num_threads);
●
int omp_get_thread_num(void);
●
int omp_get_num_procs(void);
●
void omp_set_schedule(omp_sched_t kind, int chunk_size);
●
Timing routines
– routines that support a portable wall clock timer
– double omp_get_wtime(void);
– double omp_get_wtick(void);
19/03/20 IBM 36
OpenMP Runtime library (2)
●
Lock routines (6 routines)
– Routines for thread synchronization
– OpenMP locks are represented by OpenMP lock variables
– 2 types of locks
●
Simple : can be set only once
●
Nestable : can be set multiple time by same task
– Example APIs
●
void omp_init_lock(omp_lock_t * lock); void omp_destroy_lock(omp_lock_t * lock);
●
void omp_set_nest_lock(omp_nest_lock_t * lock); void
omp_unset_nest_lock(omp_nest_lock_t * lock);
●
int omp_test_lock(omp_lock_t * lock);
●
Device memory routines (7 routines)
– routines that support allocation of memory and management of pointers in
the data environments of target devices
●
void* omp_target_alloc(size_t size, int device_num);
●
void omp_target_free(void * device_ptr, int device_num);
●
int omp_target_memcpy(void * dst, void * src, size_t length, size_t dst_offset, size_t
src_offset, int dst_device_num, int src_device_num);
19/03/20 IBM 37
1 OMP_SCHEDULE sets the run-sched-var ICV that specifies the runtime schedule type and
chunk size. It can be set to any of the valid OpenMP schedule types.
2 OMP_NUM_THREADS sets the nthreads-var ICV that specifies the number of threads to use for
parallel regions.
3 OMP_DYNAMIC sets the dyn-var ICV that specifies the dynamic adjustment of threads to use
for parallel regions
4 OMP_PROC_BIND sets the bind-var ICV that controls the OpenMP thread affinity policy.
5 OMP_PLACES sets the place-partition-var ICV that defines the OpenMP places that are
available to the execution environment.
6 OMP_NESTED sets the nest-var ICV that enables or disables nested parallelism.
7 OMP_STACKSIZE sets the stacksize-var ICV that specifies the size of the stack for threads
created by the OpenMP implementation.
8 OMP_WAIT_POLICY sets the wait-policy-var ICV that controls the desired behavior of waiting
threads.
9 OMP_MAX_ACTIVE_LEVELS sets the max-active-levels-var ICV that controls the maximum number of
nested active parallel regions.
10 OMP_THREAD_LIMIT sets the thread-limit-var ICV that controls the maximum number of
threads participating in a contention group.
11 OMP_CANCELLATION sets the cancel-var ICV that enables or disables cancellation.
12 OMP_DISPLAY_ENV instructs the runtime to display the OpenMP version number and the initial
values of the ICVs, once, during initialization of the runtime.
13 OMP_DEFAULT_DEVICE sets the default-device-var ICV that controls the default device number.
14 OMP_MAX_TASK_PRIORITY sets the max-task-priority-var ICV that specifies the maximum value that can
be specified in the priority clause of the task construct.
OpenMP Environment Variables
19/03/20 IBM 38
Best practices
●
Always initialize data (arrays) with parallel code to benefit from
first touch placement policy on NUMA nodes
●
Bind OpenMP threads to cores (see OpenMP thread affinity
policy)
●
Watch for false sharing and add padding to arrays if necessary
●
Collapse loops to increase available parallelism
●
Always use teams and distribute to expose all available
parallelism
●
Avoid small parallel blocks on GPUs and unnecessary data
transfer between host and device.
●
Overlap data transfer with compute using asynchronous
transfers
19/03/20 IBM 39
Hands-onHands-on
19/03/20 IBM
Example : matrix multiplication
A B C
X =
Matrix multiplication implementation
for(i=0;i<ROWSIZE;i++)
{
for(j=0;j<COLSIZE;j++)
{
for(k=0;k<ROWSIZE;k++)
{
pC(i,j)+=pA(i,k)*pB(k,j);
}
}
}
19/03/20 IBM 41
Example 1 – GPU offloading
This example demonstrate offloading CPU functionality on single GPU using OpenMP
pragma. The notebook facilitate compilation, running, monitoring and profiling of
example program.
19/03/20 IBM
Example : matrix multiplication
(domain decomposition to solve parallelly)
CU0
CU1
CU2
CU3
CU0
CU2
CU3
CU1
CU1CU0
CU2
CU3
Implemented domain decomposition
A B C
X =
N0
N1
N2
N3
N0
N1
N2
N3
N0
N1
N2
N3
CU – compute unit
N - node
19/03/20 IBM 43
Example 2 – GPU scaling
This example demonstrate multi GPU scaling using OpenMP pragma. The notebook
facilitate compilation, running, monitoring and profiling of example program.

Contenu connexe

Tendances

Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Netronome
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyDavid Lecomber
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
 
BKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream StategyBKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream StategyLinaro
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 

Tendances (20)

Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Optimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
BKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream StategyBKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream Stategy
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 

Similaire à Omp tutorial cpugpu_programming_cdac

OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersDhanashree Prasad
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open MpAnshul Sharma
 
MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103Linaro
 
How to control physical devices with mruby
How to control physical devices with mrubyHow to control physical devices with mruby
How to control physical devices with mrubyyamanekko
 
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideLinaro
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Akhila Prabhakaran
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...Torsten Seemann
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization Ganesan Narayanasamy
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...Yandex
 

Similaire à Omp tutorial cpugpu_programming_cdac (20)

Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
OpenMP.pptx
OpenMP.pptxOpenMP.pptx
OpenMP.pptx
 
openmpfinal.pdf
openmpfinal.pdfopenmpfinal.pdf
openmpfinal.pdf
 
Open mp intro_01
Open mp intro_01Open mp intro_01
Open mp intro_01
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103
 
How to control physical devices with mruby
How to control physical devices with mrubyHow to control physical devices with mruby
How to control physical devices with mruby
 
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
CS4961-L9.ppt
CS4961-L9.pptCS4961-L9.ppt
CS4961-L9.ppt
 
Lecture8
Lecture8Lecture8
Lecture8
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 
Lecture6
Lecture6Lecture6
Lecture6
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 

Plus de Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

Plus de Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Omp tutorial cpugpu_programming_cdac

  • 1. GPU programming made easy with OpenMP Aditya Nitsure (anitsure@in.ibm.com) Pidad D’souza (pidsouza@in.ibm.com)
  • 2. 19/03/20 IBM 2 OpenMP (Open Multi-Processing)OpenMP (Open Multi-Processing) (https://www.openmp.org) OpenMP (Open Multi-Processing)OpenMP (Open Multi-Processing) (https://www.openmp.org)
  • 3. 19/03/20 IBM 3 Take away ● What is OpenMP? ● How OpenMP programming model works ? ● How to write, compile and run OpenMP program? ● What are various OpenMP directives, runtime library API & Environment variables? ● How OpenMP program executes on accelerators (GPUs)? ● What are the best practices to write OpenMP program?
  • 4. 19/03/20 IBM 4 Introduction (1) ● Governed by OpenMP Architecture Review Board (ARB) ● Open source, simple and up to date with the latest hardware development ● Specification for shared memory parallel programming model for Fortran and C/C++ programming languages  compiler directives  library routines  environment variables ● Widely accepted by community – academician & industry
  • 5. 19/03/20 IBM 5 Introduction (2) ● Shorter learning curve ● Supported by many compiler vendors  GNU, LLVM, AMD, IBM, Cray, Intel, PGI, OpenUH, ARM  Compilers from research labs e.g. LLNL, Barcelona Supercomputing Centre ● Easy to maintain sequential, CPU parallel and GPU parallel versions of code. ● Used in application development from variety of fields and systems  Aeronautics, Automotive, Pharmaceutics, Finance  Supercomputers, Accelerators, Embedded multi-core systems
  • 6. 19/03/20 IBM 6 Specification History ➢ OpenMP 1.0 (1997 (Fortran) / 1998 (C/C++)) ● First release of Fortran & C/C++ specification (2 separate specifications) ➢ OpenMP 2.0 (2000 (Fortran) / 2002 (C/C++)) ● Addition of timing routines and various new clauses – firstprivate, lastprivate, copyprivate etc. ➢ OpenMP 2.5 (2005) ● combined standard for C/C++ & Fortran ● Introduce notion of internal control variables (ICVs) and new constructs – single, sections etc. ➢ OpenMP 3.0/3.1 (2008 – 2011) ● Concept of task added in OpenMP execution model ➢ OpenMP 4.0 (2013) ● Accelerator (GPU) support ● SIMD construct for vectorization of loops ● Support for FORTRAN 2003 ➢ OpenMP 4.5 (2015) ● Significant improvement for devices (GPU programming) & Fortran 2003 ● New taskloop construct ➢ OpenMP 5.0 (2018) – compilers are not yet support full set of features of this standard ! ● Extended memory model to distinguish different types of flush operations ● Added target-offload-var ICV and OMP_TARGET_OFFLOAD environment variable ● Teams construct extended to execute on host device
  • 7. 19/03/20 IBM 7 OpenMP Programming (1) ● The OpenMP API uses the fork-join model of parallel execution. ● The OpenMP API provides a relaxed-consistency, shared- memory model. ● An OpenMP program begins as a single thread of execution, called an initial thread. An initial thread executes sequentially. ● When any thread encounters a parallel construct, the thread creates a team of itself and zero or more additional threads and becomes the master of the new team.
  • 8. 19/03/20 IBM 8 OpenMP Programming (2) ● Compiler directives – Specified with pragma preprocessing – Starts with #pragma omp (For C/C++) – Case sensitive – Any directive is followed by one or more clauses – Syntax #pragma omp directive-name [clause[ [,] clause] ... ] new-line ● Runtime library routines – The library routines are external functions with “C” linkage – omp.h provides prototype definition of all library routines ● Environment variables – Set at the program start up – Specifies the settings of the Internal Control Variables (ICVs) that affect the execution of OpenMP programs – Always in upper case
  • 9. 19/03/20 IBM 9 OpenMP Hello World #include <stdio.h> #include <omp.h> int main() { int i = 0; int numThreads = 0; // call to OpenMP runtime library numThreads = omp_get_num_threads(numThreads); // OpenMP directive #pragma omp parallel { // call to OpenMP runtime library int threadNum = omp_get_thread_num(); printf(“Hello World from thread %d n”, threadNum); } return 0; } Output [aditya@hpcwsw7 openmp_tutorial]$ ./helloWorld Hello World from thread 0 Hello World from thread 3 Hello World from thread 1 Hello World from thread 2 [aditya@hpcwsw7 openmp_tutorial]$ ./helloWorld Hello World from thread 0 Hello World from thread 1 Hello World from thread 2 Hello World from thread 3 // Setting OpenMP environment variable export OMP_NUM_THREADS=4 //compile helloWorld program xlc -qsmp=omp helloWorld.c -o helloWorld // Run helloWorld program ./helloWorld helloWorld.c
  • 10. 19/03/20 IBM 10 Compilation of OpenMP program XL ● CPU -qsmp=omp ● GPU -qsmp=omp -qoffload -qtgtarch=sm_70 (V100 GPU) GNU ● CPU -fopenmp ● GPU -fopenmp -foffload=nvptx-none NVIDIA CUDA linking -lcudart -L/usr/local/cuda/lib64
  • 11. 19/03/20 IBM 11 OpenMP Directives ● Parallel construct ● SIMD construct ● Combined construct ● Work sharing construct ● Master and synchronization construct ● Tasking construct ● Device construct
  • 12. 19/03/20 IBM 12 Parallel construct #pragma omp parallel [clause[ [,] clause] ... ] new-line { } ● The fundamental construct that starts parallel execution ● A team of threads is created to execute parallel region ● Original thread becomes master of the new team ● All threads in the team executes parallel region #pragma omp parallel Implicit barrier at the end of parallel region Original thread becomes master thread in parallel region #pragma omp parallel { int threadNum = omp_get_thread_num(); printf(“Hello World from thread %d n”, threadNum); }
  • 13. 19/03/20 IBM 13 SIMD construct ● Applied on loop to transform loop iterations to execute concurrently using SIMD instructions ● When any thread encounters a simd construct, the iterations of the loop associated with the construct may be executed concurrently using the SIMD lanes that are available to the thread. #pragma omp simd [clause[ [,] clause] ... ] new-line { } #pragma omp parallel for simd for (i=0; i<N; i++) { a[i] = b[i] * c[i] } #pragma omp parallel for simd Implicit barrier at the end of parallel region b0 b1 b2 b3 c0 c1 c2 c3 * vector unitEach iteration of each thread executed concurrently
  • 14. 19/03/20 IBM 14 Combined construct ● Combination of more than one construct – Specifies one construct immediately nested inside another construct ● Clauses from both constructs are permitted – With some exceptions e.g. nowait clause cannot be specified in parallel for or parallel sections ● Examples – #pragma omp parallel for – #pragma omp parallel for simd – #pragma omp parallel sections – #pragma omp target parallel for simd
  • 15. 19/03/20 IBM 15 Worksharing construct ● Distributes the execution of the associated parallel region among the members of the team ● Implicit barrier at the end ● Types – Loop construct – Sections construct – Single construct – Workshare construct (only in Fortran)
  • 16. 19/03/20 IBM 16 Worksharing : Loop construct ● Iterations of for-loop distributed among the threads ● One or more iterations executed in parallel ● Implicit barrier at the end unless nowait #pragma omp for [clause[ [,] clause] ... ] new-line // for-loop { } int N = 200; #pragma omp parallel { #pragma omp for for (i=0; i<N; i++) { a[i] = b[i] * c[i] } } #pragma omp parallel #pragma omp for Implicit barrier at the end of for-loop Iteration space is divided among threads (0-39) (40-79) (80-119) (120-159) (160-199)
  • 17. 19/03/20 IBM 17 Worksharing : Section construct ● Non iterative worksharing construct ● Each section structured block is executed once ● The section structured blocks are executed in parallel (one thread per section) ● Implicit barrier at the end unless nowait specified #pragma omp sections [clause[ [,] clause] ... ] new-line { #pragma omp section new-line // structured block #pragma omp section new-line // structured block } #pragma omp parallel { #pragma omp sections { #pragma omp section { // do_work1 } #pragma omp section { // do_work2 } } } #pragma omp parallel #pragma omp sections Implicit barrier at the end of sections clause section divided among threads section1 section2 section3
  • 18. 19/03/20 IBM 18 Worksharing : Single construct ● Only one thread in the team executes the associated structured block ● Implicit barrier at the end ● All other threads wait until single block is finished ● nowait clause removes barrier at the end of single construct #pragma omp single [clause[ [,] clause] ... ] new-line { } #pragma omp parallel end of parallel region Single section Only one thread executes this section Barrier at the end of single section #pragma omp single #pragma omp parallel { // do_some_work() #pragma omp single { //write result to file } // do_post_work() }
  • 19. 19/03/20 IBM 19 Tasking construct ● Defines an explicit task ● A task is generated and corresponding data environment is created from the associated structured block ● Execution of task could be immediate or defer for later execution ● Examples : While-loop, recursion, sorting algorithms #pragma omp task [clause[ [,] clause] ... ] new-line { } #pragma omp parallel { #pragma omp single { While ( ptr != null) { #pragma omp task { //do_some_work(); } } // end of while } // end of single } // end of parallel #pragma omp parallel end of parallel region Task scheduler schedules tasks with available threads #pragma omp single end of single Single thread creates tasks
  • 20. 19/03/20 IBM 20 Synchronization construct (1) #pragma omp parallel shared (sum) private (localSum) { #pragma omp for for (i=0; i<N; i++) { localSum += a[i]; } #pragma omp critical (global_sum) { sum += localSum; } } Critical section One thread at a time 4 threads executes sequentially #pragma omp parallel end of parallel region end of for Critical construct Master construct Ordered construct #pragma omp parallel { #pragma omp for // do_pre_work() #pragma omp master { //write result to file } #pragma omp for // do_post_work() } end of for end of for #pragma omp parallel end of parallel region Master section Only master thread executes this section No entry barrier No exit barrier #pragma omp parallel { #pragma omp for for (i=0; i<N; i++) { // do_some_work #pragma omp ordered { sum += a[i]; } // do_post_work } } #pragma omp parallel end of parallel region Ordered section Iteration order is preserved Start of Ordered construct End of Ordered construct End of for
  • 21. 19/03/20 IBM 21 Synchronization construct (2) ● Barrier construct – #pragma omp barrier new-line – Stand alone directive – “Must to execute” for all threads in parallel region – No one continues unless all threads reach barrier ● Taskwait construct – #pragma omp taskwait new-line – Stand alone directive – Waits until all child tasks completes execution before taskwait region ● Atomic construct – #pragma omp atomic [atomic-clause] new-line – ensures that a specific storage location is accessed atomically – Enforces atomicity for read, write, update or capture ● Flush construct – #pragma omp flush [(list)] new-line – Stand alone directive – Makes temporary view of thread’s memory consistent with main memory – When list is specified, the flush operation applies to the items in the list. Otherwise to all thread visible data items.
  • 22. 19/03/20 IBM 22 GPU offloading using OpenMPGPU offloading using OpenMP
  • 23. 19/03/20 IBM 23 Device construct (1) #pragma omp parallel for Implicit barrier #pragma omp parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } for-loop CPU (no threading) GPU OpenMPCPU OpenMP #pragma omp target #pragma omp parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } #pragma omp target #pragma omp parallel for Target construct ● Execute construct on device (GPU) ● Maps variables to device data environment Note ● No thread migration from one device to another. ● In absence of device or unsupported implementation for the target device, all target regions execute on the host.
  • 24. 19/03/20 IBM 24 1. Thread starts on the host and host creates the data environments on the device(s). 2. The host then maps data to the device data environment. (data moves to device) 3. Host offloads target regions to target devices. (code executes on device) 4. After execution, host updates the data between the host and the device (transfers data from device to host) 5. Host destroys data environment on the device Device Construct (2) #pragma omp target #pragma omp parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } #pragma omp target #pragma omp parallel for 1 2 3 4 5 The host centric execution model Teams executes asynchronously Team1 Team2 Threads within teams synchronise CUDA block == Team
  • 25. 19/03/20 IBM 25 Device construct (3) GPU OpenMP #pragma omp target teams #pragma omp parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } #pragma omp target teams #pragma omp parallel for Team construct ● Creates league of thread teams ● Master thread of each team executes the region ● No implicit barrier at the end of team construct #pragma omp target #pragma omp teams distribute parallel for for (i=0; i<N; i++) { a[i] = b[i] + s * c[i]; } #pragma omp target #pragma omp teams distribute parallel for Distribute construct ● Distributes iterations among master thread of all teams ● No implicit barrier at the end of distribute construct ● Always associated with loops and binds to team region team1 team2 team3 team1 team2 team3
  • 26. 19/03/20 IBM 26 Device construct (4) ● Target data – #pragma omp target data [clause] new-line {structured block} – Maps variables to device data environment and task encountering the construct executes the region ● Target enter data / Target exit data – #pragma omp target enter data [clause] new-line – #pragma omp target exit data [clause] new-line – Stand alone directive – Variables are mapped/unmapped to/from device data environment – Generates a new task which enclosed target data region
  • 27. 19/03/20 IBM 27 Device Data Mapping ● Map clause – Maps variables from host data environment to corresponding variables in device data environment – map([ [map-type-modifier[,]] map-type : ] list) – map-type ● to (initializes variables on device variables from host corresponding variables on host) ● from (Writes back data from device to host) ● tofrom – default map type (combination of to & from) ● alloc (allocates device memory for variables) ● release (reference count is decremented by one) ● delete (deletes device memory of variables)
  • 28. 19/03/20 IBM 28 Device construct (5)
  • 29. 19/03/20 IBM 29 Device construct (5) ● if clause – if (directive name modifier : scalar expression) – Used in target data, target enter/exit data, combined construct – In case of target clauses if clause expression evaluates to false then target region executed on host ● device (n) – Specifies the device (GPU) on which the target region to execute. ‘n’ is device id. – In absence of device clause, the target region executes on default device
  • 30. 19/03/20 IBM 30 OpenMP Clauses (1) ● Schedule [modifier : kind, chunk_size] – Kind ● Static : Iterations are divided into chunks and assigned to threads in round-robin faction ● Dynamic : Iterations are divided into chunks and assigned to threads on dynamically (first-come-first-serve basis) ● Guided : Threads grab the chunk of iterations and gradually chunk reduces to chunk_size as execution proceeds ● Auto : compiler or OpenMP runtime decides optimum schedule from static, dynamic or guided ● Runtime : Deferred until runtime and schedule type and chunk size are taken from environment variable. – Modifier (new addition in OpenMP 4.5 specification) ● simd : The chunk size is adjusted to simd width. Works only with SIMD construct e.g. schedule(simd:static, 5) ● Monotonic : Chunks are assigned in increasing logical order of iterations to each threads. ● Nonmonotonic : Assignment order of chunks to threads is unspecified.
  • 31. 19/03/20 IBM 31 OpenMP Clauses (2) ● collapse (n) – Collapses two or more loops to create larger iteration space – The parameter (n), specifies the number of loops associated with loop construct to be collapsed #pragma omp parallel for collapse (2) for (i=0; i<N; i++) { for (j=0; j<M; j++){ for (k=0; k<M; k++){ foo(i, j, k); }}} ● nowait – Used for asynchronous thread movement in worksharing construct i.e. threads that finish early proceed to the next instruction without waiting for other thread. – Used in loop, sections, single, target and so on
  • 32. 19/03/20 IBM 32 Data Sharing Attribute Clauses (1) ● Private (list) – Declares variables as to be private to a task or thread – All references to the original variable are replaced by references to new variable when respective directive is encountered – Private variables are not initialized by default ● shared (list) – Declares variables to be shared by tasks or threads – All references to the variable points to the original variable when respective directive is encountered ● firstprivate (list) – Initializes private variable with value of original variable prior to execution of the construct – Used with parallel, task, target, teams etc. ● lastprivate (list) – Writes the value of the private variable back to original variable on existing the respective construct – Updated value is sequentially last iteration or section of worksharing construct.
  • 33. 19/03/20 IBM 33 Data Sharing Attribute Clauses (2) ● reduction (reduction identifier : list) – Updates original value with the final value of each of the private copies, using the combiner – operator or identifier – Reduction identifier is either identifier (max, min) or operator (+, -, * , &, |, ˆ, && and ||) ● threadprivate (list) – Makes global variables (file scope, static) private to each thread – Each thread keep its own copy of global variable
  • 34. 19/03/20 IBM 34 Data Copying Clauses Copy value of private or threadprivate variable from one thread to another thread ● copyin(list) – Allowed on parallel and combined worksharing construct – Copy value of threadprivate variable of master thread to threadprivate variable of other threads before start of execution ● copyprivate(list) – Allowed on single construct – Broadcast the value of private variable from one thread to other threads in team
  • 35. 19/03/20 IBM 35 OpenMP Runtime library (1) ● Runtime library definitions – omp.h stores prototype of all OpenMP routines/functions and type. ● Execution environment routines (35+ routines) – Routines that affects and monitor threads, processors and parallel environment ● void omp_set_num_threads(int num_threads); ● int omp_get_thread_num(void); ● int omp_get_num_procs(void); ● void omp_set_schedule(omp_sched_t kind, int chunk_size); ● Timing routines – routines that support a portable wall clock timer – double omp_get_wtime(void); – double omp_get_wtick(void);
  • 36. 19/03/20 IBM 36 OpenMP Runtime library (2) ● Lock routines (6 routines) – Routines for thread synchronization – OpenMP locks are represented by OpenMP lock variables – 2 types of locks ● Simple : can be set only once ● Nestable : can be set multiple time by same task – Example APIs ● void omp_init_lock(omp_lock_t * lock); void omp_destroy_lock(omp_lock_t * lock); ● void omp_set_nest_lock(omp_nest_lock_t * lock); void omp_unset_nest_lock(omp_nest_lock_t * lock); ● int omp_test_lock(omp_lock_t * lock); ● Device memory routines (7 routines) – routines that support allocation of memory and management of pointers in the data environments of target devices ● void* omp_target_alloc(size_t size, int device_num); ● void omp_target_free(void * device_ptr, int device_num); ● int omp_target_memcpy(void * dst, void * src, size_t length, size_t dst_offset, size_t src_offset, int dst_device_num, int src_device_num);
  • 37. 19/03/20 IBM 37 1 OMP_SCHEDULE sets the run-sched-var ICV that specifies the runtime schedule type and chunk size. It can be set to any of the valid OpenMP schedule types. 2 OMP_NUM_THREADS sets the nthreads-var ICV that specifies the number of threads to use for parallel regions. 3 OMP_DYNAMIC sets the dyn-var ICV that specifies the dynamic adjustment of threads to use for parallel regions 4 OMP_PROC_BIND sets the bind-var ICV that controls the OpenMP thread affinity policy. 5 OMP_PLACES sets the place-partition-var ICV that defines the OpenMP places that are available to the execution environment. 6 OMP_NESTED sets the nest-var ICV that enables or disables nested parallelism. 7 OMP_STACKSIZE sets the stacksize-var ICV that specifies the size of the stack for threads created by the OpenMP implementation. 8 OMP_WAIT_POLICY sets the wait-policy-var ICV that controls the desired behavior of waiting threads. 9 OMP_MAX_ACTIVE_LEVELS sets the max-active-levels-var ICV that controls the maximum number of nested active parallel regions. 10 OMP_THREAD_LIMIT sets the thread-limit-var ICV that controls the maximum number of threads participating in a contention group. 11 OMP_CANCELLATION sets the cancel-var ICV that enables or disables cancellation. 12 OMP_DISPLAY_ENV instructs the runtime to display the OpenMP version number and the initial values of the ICVs, once, during initialization of the runtime. 13 OMP_DEFAULT_DEVICE sets the default-device-var ICV that controls the default device number. 14 OMP_MAX_TASK_PRIORITY sets the max-task-priority-var ICV that specifies the maximum value that can be specified in the priority clause of the task construct. OpenMP Environment Variables
  • 38. 19/03/20 IBM 38 Best practices ● Always initialize data (arrays) with parallel code to benefit from first touch placement policy on NUMA nodes ● Bind OpenMP threads to cores (see OpenMP thread affinity policy) ● Watch for false sharing and add padding to arrays if necessary ● Collapse loops to increase available parallelism ● Always use teams and distribute to expose all available parallelism ● Avoid small parallel blocks on GPUs and unnecessary data transfer between host and device. ● Overlap data transfer with compute using asynchronous transfers
  • 40. 19/03/20 IBM Example : matrix multiplication A B C X = Matrix multiplication implementation for(i=0;i<ROWSIZE;i++) { for(j=0;j<COLSIZE;j++) { for(k=0;k<ROWSIZE;k++) { pC(i,j)+=pA(i,k)*pB(k,j); } } }
  • 41. 19/03/20 IBM 41 Example 1 – GPU offloading This example demonstrate offloading CPU functionality on single GPU using OpenMP pragma. The notebook facilitate compilation, running, monitoring and profiling of example program.
  • 42. 19/03/20 IBM Example : matrix multiplication (domain decomposition to solve parallelly) CU0 CU1 CU2 CU3 CU0 CU2 CU3 CU1 CU1CU0 CU2 CU3 Implemented domain decomposition A B C X = N0 N1 N2 N3 N0 N1 N2 N3 N0 N1 N2 N3 CU – compute unit N - node
  • 43. 19/03/20 IBM 43 Example 2 – GPU scaling This example demonstrate multi GPU scaling using OpenMP pragma. The notebook facilitate compilation, running, monitoring and profiling of example program.