Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
2. Objective
! software has a long life-span that exceeds the life-span of hardware
! software is very expensive to be written and maintained
! next generation hardware also needs to run legacy software
! Example: IWAVE
! procedural C-code
! no object orientation
! tight integration between data structures and functions
! What do I mean by efficient scheduling?
! find ways to utilize GPU cores for code blocks
! find ways to utilize all CPU cores and GPU units at the same time
!2
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
3. Historical Context
GPU Compute Timeline
Aparapi
CUDA
2002
!3
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
2008
AMP C++
2010
2012
4. Accelerator Challenges
Technology Accessibility and Performance
Performance
OpenCL & CUDA
CPU Multithread
CPU Single Thread
Ease-of-Use
!4
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
5. APU Opportunities
One Die - Two Computational Devices
Metric
CPU
APU
Memory Size
large
small
Memory Bandwidth
small
large
Parallelism
small
large
yes
no
Performance
application dependent
application dependent
Performance-per-Watt
application dependent
application dependent
Traditional
OpenCL
General Purpose
Programming
!5
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
6. APU Opportunities
Performance and Performance-per-Watt
! Example: Luxmark OpenCL Benchmark
APU
Performance[Pts]
170
197
316
50
37
58
3.4
5.3
5.4
Combined[Pts2/W]
! GPU has best performance-per-Watt
GPU
PPW[Pts/W]
! Best performance by using the APU
CPU
Power[W]
! Similar CPU and GPU performance
Metric
578
1049
1722
! APU provides outstanding value
Luxmark OpenCL Benchmark
Ubuntu 12.10 x86_64
4 Piledriver CPU cores @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!6
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
7. Example: Luxmark Renderer
Performance and Performance-per-Watt
+64%
+81%
!7
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Luxmark OpenCL Benchmark
Render “Sala” Scene
Ubuntu 12.10 x86_64
4 Piledriver cores @ 2.5GHz
6 GPU CUs @ 720MHz
16GB DDR3 1600MHz
8. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! Know the problem you are trying to solve.
! staggered rectangular grid in 3D
! coupled first order PDE
! scalar pressure field p
! vector velocity field v = {vx, vy, vz}
! source term g
!8
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
9. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenMP(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}
OpenMP p
OpenMP vx
//
//
//
//
//
main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis
OpenMP vy
OpenMP vz
OpenMP
Time
!9
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
10. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! Measure the initial performance.
! pressure and velocity field simulated using OpenMP
! average time T[ms] per iteration
! OpenMP linear scaling with threads
!10
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
11. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! find computational blocks
! understand dependencies between blocks
OpenMP vx
OpenMP p
OpenMP vy
! identify sequential and parallel parts
OpenMP
OpenMP vz
Causality
OpenMP p
OpenMP vx
OpenMP vy
OpenMP vz
OpenMP
Time
!11
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
12. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}
//
//
//
//
//
main simulation loop
calculate pressure field p
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis
OpenCL vx
OpenMP p
IDLE
OpenMP vy
OpenMP vz
OpenMP
Time
!12
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
13. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! use the GPU to compute vx
! the CPU is idle while the GPU is running
! 42% improvement for 1 thread
! 25% improvement for 2 threads
! 9% improvement for 4 threads
!13
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
14. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
!
!
// main simulation loop
// calculate pressure field p
int num_threads = atoi(getenv("OMP_NUM_THREADS"));
omp_set_num_threads(2);
omp_set_nested(1);
#pragma omp parallel shared(…) private(…)
{
switch ( omp_get_thread_num() ) {
case 0:
sgn_ts3d_210_v0_OpenCL(dom, pars)
break;
case 1:
omp_set_num_threads(num_threads);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
break;
default:
break;
}
}
x
}
OpenCL v
OpenMP p
OpenMP vy
OpenMP vz
// save the current number of OpenMP threads
// restrict the number of OpenMP threads to 2
// allow nested OpenMP threads
// start 2 OpenMP threads
// calculate velocity x-axis using OpenCL
// increase number of OpenMP threads back
// calculate velocity y-axis
// calculate velocity z-axis
// close OpenMP pragma
// close simulation while
OpenMP
Time
!14
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
15. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! overlap vx and vy
! CPU not idle anymore
! 50% improvement for 1 thread
! 40% improvement for 2 threads
! 38% improvement for 4 threads
!15
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
16. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_p012_OpenCL(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenCL(dom, pars);
sgn_ts3d_210_v2_OpenCL(dom, pars);
…
}
//
//
//
//
//
bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);
clEnqueueNDRangeKernel(queue, kernel_P012, dims, …);
clEnqueueReadBuffer(queue, buffer, …);
…
}
OpenCL p
OpenCL vx
OpenCL vy
main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis
// copy data from host to device
// execute OpenCL kernel on device
// copy data from device to host
OpenCL vz
OpenCL
Time
!16
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
17. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! understand where performance gets lost
! 98% of time spent on I/O
! 2% of time spent on compute
! reduce I/O
OpenCL Upload
Kernel Execution
OpenCL Download
188ms
4ms
54ms
OpenCL vx
OpenMP p
OpenMP vy
OpenMP vz
OpenMP
Time
!17
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
18. Programming Strategies
Example: High Throughput Computer Vision with OpenCV
! How does the speedup of an OpenCL application
(SOpenCL) depend on speedup of the OpenCL kernel
(SKernel) when the OpenCL I/O time is fixed?
! Fraction of OpenCL I/O time: FI/O
! 50% I/O time limit the maximal possible speedup to 2
! Minimize OpenCL I/O, only then increase OpenCL
kernel performance
!18
SKernel
SOpenCL =
HSKernel - 1L FIêO + 1
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
19. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_ALL_OpenCL(dom, pars);
…
}
// main simulation loop
// combine all OpenCL calculations
bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);
!
!
while(…) {
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
kernel_P012, dims, …);
kernel_V0, dims, …);
kernel_V1, dims, …);
kernel_V1, dims, …);
// copy data from host to device
//
//
//
//
execute
execute
execute
execute
OpenCL
OpenCL
OpenCL
OpenCL
kernel
kernel
kernel
kernel
for
for
for
for
pressure
velocity x
velocity y
velocity z
}
clEnqueueReadBuffer(queue, buffer, …);
…
// copy data from device to host
}
OpenCL p
OpenCL vx
OpenCL vy
OpenCL vz
OpenCL
Time
!19
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
20. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! eliminate all but essential I/O
! significant speedup over simple OpenCL
!20
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
21. Programming Strategies
Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! measure real application performance
! 3000 iterations using a 97x405x389 simulation grid
! 8 GCN Compute Units achieve 70% more
performance than 8 traditional OpenMP threads
14
10.5
7
3.5
0
CPU (8T) "Piledriver"
!21
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
GPU (8CU)
AMD S9000
22. Programming Strategies
Example: High Throughput Computer Vision with OpenCV
! initial OpenCL performance measurements
! 89 Algorithms tested for image size of 4MP
! compare OpenCL I/O and execution time
! 28% of all algorithms are compute bound
! 72% of all algorithms are I/O bound
OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!22
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
23. Programming Strategies
Example: High Throughput Computer Vision with OpenCV
! compare OpenCL and single-threaded performance
! 89 Algorithms tested for image size of 4MP
! realistic timing that includes I/O over PCIe
! 59% of all algorithms execute faster on the GPU
! 41% of all algorithms execute faster on the CPU(1)
! significant speedup for only 15% of all algorithms
OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!23
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
24. Programming Strategies
Example: High Throughput Computer Vision with OpenCV
! Task: Batch process a large amount of images using a single algorithm.
! OpenCL performance is algorithm and image size dependent
! Either the CPU will process data or the GPU, but not both
! How to choose which algorithm and device to use depending on image size?
!24
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
26. Programming Strategies
Example: High Throughput Computer Vision with OpenCV
! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty.
! all CPU cores are fully utilized at all times even for single-threaded algorithms
! all GPU compute units are fully utilized at all times
! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm
! combined performance for multiple algorithms is better than sum of device performance
P
i
APU
=P
P=
!26
| OpenCL and OpenMP Workloads on Accelerated Processing Units |
i
CPU
+P
i
N
1
⁄i=1 Pi
1
GPU
28. Programming Strategies
Summary
!
! next generation hardware and legacy code requires compromises
! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time
! application performance can be increased by overlapping OpenCL and OpenMP workloads
! removing all but necessary OpenCL I/O can have a dramatic influence on performance
! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms
! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances
! APUs may provide greatest performance per Watt
! GPUs may provide greatest performance
!28
| OpenCL and OpenMP Workloads on Accelerated Processing Units |