Contenu connexe Similaire à An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar (20) Plus de AMD Developer Central (20) An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar1. An Introduction to OpenCL™ Using AMD GPUs
Chris Mason Product Manager, Acceleware September 17, 2014 2. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
About Acceleware
Programmer Training
–OpenCL, CUDA, OpenMP
–Over 100 courses taught
–http://acceleware.com/training
Consulting Services
–Completed projects for: Oil & Gas, Medical, Finance, Security & Defence, Computer Aided Engineering, Media & Entertainment
–http://acceleware.com/services
GPU Accelerated Software
–Seismic imaging & modeling
–Electromagnetics
2 3. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Seismic Imaging & Modeling
AxWave
–Seismic forward modeling
–2D, 3D, constant and variable density models
–High fidelity finite-difference modeling
AxRTM
–High performance Reverse Time Migration Application
–Isotropic, VTI and TTI media
HPC Implementation
–Optimized for GPUs
–Efficient multi-GPU scaling
3 4. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Electromagnetics
AxFDTD™
–Finite-Difference Time-Domain Electromagnetic Solver
–Optimized for GPUs
–Sub-gridding and large feature coverage
–Multi-GPU, GPU clusters, GPU targeting
–Available from:
4 5. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Consulting Services
Industry
Application
Work Completed
Results
Finance
Option Pricing
Debugged & optimized existing code Implemented the Leisen-Reimer version of the binomial model for stock option pricing
30-50x performance improvement compared to single-threaded CPU code
Security & Defense
Detection System
Replaced legacy Cell-based infrastructure with GPUs
Implemented a GPU accelerated X-ray iterative image reconstruction and explosive detection algorithms
Surpassed the performance targets Reduced hardware cost by a factor of 10
CAE
SIMULIA Abaqus
Developed a GPU accelerated version Conducted a finite-element analysis and developed a library to offload LDLT factorization portion of the multi-frontal solver to GPUs
Delivered an accelerated (2- 3x) solution that supports NVIDIA and AMD GPUs
Medical
CT Reconstruction Software
Developed a GPU accelerated application for image reconstruction on CT scanners and implemented advanced features including job batch manager, filtering and bad pixel corrections
Accelerated back projection by 31x
Oil & Gas
Seismic Application
Converted MATLAB research code into a standalone application & improved performance via algorithmic optimizations
20-30x speedup
5 6. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Programmer Training
OpenCL, CUDA, OpenMP
Teachers with real world experience
Hands-on lab exercises
Progressive lectures
Small class sizes to maximize learning
90 days post training support
“The level of detail is fantastic. The course did not focus on syntax but rather on how to expertly program for the GPU. I loved the course and I hope that we can get more of our team to take it.”
Jason Gauci, Software Engineer
Lockheed Martin
6 7. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Outline
Introduction to the OpenCL Architecture
–Contexts, Devices, Queues
Memory and Error Management
Data-Parallel Computing
–Kernel Launches
GPU Kernels
7 8. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Introduction To The OpenCL Architecture 9. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Architecture Introduction and Terminology
Four high level models describe the key OpenCL concepts:
–Platform Model – high level host/device interaction
–Execution Model – OpenCL programs execute on host/device
–Memory Model – different memory resources on device
–Programming Model – types of parallel workloads
9 10. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Platform Model
A host connected to one or more devices
–Example: GPUs, DSPs, FPGAs
A program can work with devices from multiple vendors
A platform is a host and a collection of devices that share resources and execute programs
10
Host
Device 1 GPU
Device 2 CPU
…
Device N GPU 11. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Execution Model
The host defines a context to control the device
–The context manages the following resources:
–Devices – hardware to run on
–Kernels – functions to run on the hardware
–Program Objects – device executables
–Memory Objects – memory visible to host and device
A command queue schedules commands for execution on the device
11 12. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL API - Platform and Runtime Layer
The OpenCL API is divided into two layers: Platform and Runtime
The platform layer allows the host program to discover devices and capabilities
The runtime layer allows the host program to work with contexts once created
12 13. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Program Set Up
To set up an OpenCL program, the typical steps are as follows:
1.Query and select the platforms (e.g., AMD)
2.Query the devices
3.Create a context
4.Create a command queue
5.Read/Write to the device
6.Launch the kernel
13
Platform Layer
Runtime Layer 14. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Sample Platform Layer C Code
14
//Get the platform ID
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
// Get the first GPU device associated with the platform
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
//Create an OpenCL context for the GPU device
cl_context context;
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL); 15. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Runtime Layer
A command queue operates on contexts, memory, and program objects
Each device can have one or more command queues
Operations in the command queue will execute in order unless the out of order mode is enabled
15
Copy Data
Copy Data
Launch Kernel
Copy Data
Command Queue 16. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Memory and Error Management 17. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Buffers
A buffer stores a one dimensional collection of elements
Buffer objects use the cl_mem type
–cl_mem is an abstract memory container (i.e., a handle)
–The buffer object cannot be dereferenced on the host
•cl_mem a; a[0] = 5; // Not allowed
OpenCL commands interact with buffers
17 18. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax – C Memory Management Example
Example:
18
//Create an OpenCL command queue
cl_int err;
cl_command_queue queue;
queue = clCreateCommandQueue(context, device, 0, &err);
// Allocate memory on device
const int N = 5;
int nBytes = N*sizeof(int);
cl_mem a = clCreateBuffer(context, CL_MEM_READ_WRITE,
nBytes, NULL, &err);
int hostarr [N] = {3,1,4,1,5};
// Transfer Memory
err = clEnqueueWriteBuffer(queue, a, CL_TRUE, 0,
nBytes, hostarr, 0, NULL,
NULL); 19. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax – Error Management
Host code manages errors:
–Most host side OpenCL function calls return cl_int
•“Create” calls return the object that is created
–Error code is passed by reference as last argument
•Error codes are negative values defined in cl.h
•CL_SUCCESS == 0
19 20. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax – Clean Up
All objects that are created can be released with the following functions:
–clReleaseContext
–clReleaseCommandQueue
–clReleaseMemObject
20 21. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Data-Parallel Computing 22. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Data-Parallel Computing
Data-parallelism
1.Performs operations on a data set organized into a common structure (e.g. an array)
2.Tasks work collectively on the same structure with each task operating on its own portion of the structure
3.Tasks perform identical operations on their portions of the structure. Operations on each portion must not be data dependent!
22 23. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Data Dependence
Data dependence occurs when a program statement refers to the data of a preceding statement.
Data dependence limits parallelism
23
a = 2 * x;
b = 2 * y;
c = 3 * x;
a = 2 * x; b = 2 * a * a; c = b * 9;
These 3 statements are independent!
b depends on a, c depends on b and a! 24. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Data-Parallel Computing Example
Data set consisting of arrays A,B, and C
Same operations performed on each element - Cx = Ax + Bx
Two tasks operating on a subset of the arrays. Tasks 0 and 1 are independent. Could have more tasks.
24
A0
A1
A2
A3
A4
A5
A6
A7
B0
B1
B2
B3
B4
B5
B6
B7
C1
C2
C3
C4
C5
C6
C7
C0
Cx = Ax + Bx
Task 0
Task 1
Operation 25. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
The OpenCL Programming Model
OpenCL is a heterogeneous model, including provisions for both host and device
25
CPU
Chipset
DRAM
DRAM
DSP or GPU or FPGA
Device
Host
PCIe 26. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
The OpenCL Programming Model
Data-parallel portions of an algorithm are executed on the device as kernels
–Kernels are C functions with some restrictions, and a few language extensions
Only one kernel is executed at a time
A kernel is executed by many work-items
–Each work-item executes the same kernel
26 27. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Work-Items
OpenCL work-items are conceptually similar to data- parallel tasks or threads
–Each work-item performs the same operations on a subset of a data structure
–Work-items execute independently
OpenCL work-items are not CPU threads
–OpenCL work-items are extremely lightweight
•Little creation overhead
•Instant context-switching
–Work-items must execute the same kernel
27 28. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Work-Item Hierarchy
OpenCL is designed to execute millions of work-items
Work-items are grouped together into work-groups
–Maximum # of work-items per work-group (HW limit)
–Query CL_DEVICE_MAX_WORK_GROUP_SIZE in clDeviceInfo
•Typically 256-1024
The entire collection of work-items is called the N- Dimensional Range (NDRange)
28 29. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Work-Item Hierarchy
Work-groups and NDRange can be 1D, 2D, or 3D
Dimensions set at launch time
29
Work-Item (3,0)
Work-Item (1,0)
Work-Item (2,0)
Work-Item (0,0)
Work-Item (3,1)
Work-Item (1,1)
Work-Item (2,1)
Work-Item (0,1)
Work-Item (3,2)
Work-Item (1,2)
Work-Item (2,2)
Work-Item (0,2)
Work-Group (1,1)
Work-Group (0,0)
Work-Group (1,0)
Work-Group (2,0)
Work-Group (0,1)
Work-Group (1,1)
Work-Group (2,1)
ND Range 30. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
The OpenCL Programming Model
The host launches kernels
The host executes serial code between device kernel launches
–Memory management
–Data exchange to/from device (usually)
–Error handling
30
Work-Group (0,0)
Work-Group (1,0)
Work-Group (0,1)
Work-Group (1,1)
Work-Group (0,2)
Work-Group( 1,2)
ND Range
Work-Group (0,0)
Work-Group (1,0)
Work-Group (2,0)
Work-Group (0,1)
Work-Group (1,1)
Work-Group (2,1)
ND Range
Host
Device
Host
Device 31. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Data-Parallel Computing on GPUs
Data-parallel computing maps well to GPUs:
–Identical operations executed on many data elements in parallel
•Simplified flow control allows increased ratio of compute logic (ALUs) to control logic
31
DRAM
GPU
DRAM
CPU
ALU
Control
L1 Cache
L2 Cache
ALU
ALU
ALU
ALU
Control
L1 Cache
L2 Cache
ALU
ALU
ALU
L3 Cache 32. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL API – Launching a Kernel C
How to launch a kernel:
32
//3D Work-Group, let OpenCL Runtime determine
//local work size.
size_t const globalWorkSize[3] = {512,512,512};
clEnqueueNDRangeKernel(queue, kernel, 3, NULL, globalWorkSize, NULL,
0, NULL, NULL);
//2D Work-Group, specify local work size
size_t const globalWorkSize[2] = {512,512};
size_t const localWorkSize[2] = {16, 16};
clEnqueueNDRangeKernel(queue, kernel, 2, NULL,
globalWorkSize, localWorkSize,
0, NULL, NULL); 33. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
GPU Kernels 34. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Writing OpenCL Kernels
Denoted by __kernel function qualifier
–Eg. __kernel void myKernel(__global float* a)
Queued from host, executed on device
A few noteworthy restrictions:
–No access to host memory (in general!)
–Must return void
–No function pointers
–No static variables
–No recursion (no stack)
34 35. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax - Kernels
Kernels have built-in functions:
–The variable dim ranges from 0 to 2, depending on the dimension of the kernel launch
–get_work_dim (): number of dimensions in use
–get_global_id (dim): unique index of a work-item
–get_global_size (dim): number of global work-items
35 36. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax – Kernels (Continued)
Built-in function listing (continued):
–get_local_id (dim): unique index of the work-item within the work-group
–get_local_size (dim): number of work-items within the work-group
–get_group_id (dim): index of the work-group
–get_num_groups (dim): number of work-groups
–Cannot vary the size of work-groups or work-items during a kernel call
36 37. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax - Kernels
Built-in functions are typically used to determine unique work-item identifiers:
37
get_group_id(0)
get_local_size(0) = 5
get_global_id(0)
ND Range
0
0 1 2 3 4
1
0 1 2 3 4
2
0 1 2 3 4
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
get_local_id(0)
One Dimensional Array (get_work_dim () == 1)
get_global_id(0) == get_group_id(0) * get_local_size(0) + get_local_id(0) 38. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax – Thread Identifiers
Result for each kernel launched with the following execution configuration:
Dimension = 1 Global work size = 12 Local Work Size = 4
38
__kernel void MyKernel(__global int* a)
{
int idx = get_global_id(0);
a[idx] = 7;
}
__kernel void MyKernel(__global int* a)
{
int idx = get_global_id(0);
a[idx] = get_group_id(0);
}
__kernel void MyKernel(__global int* a)
{
int idx = get_global_id(0);
a[idx] = get_local_id(0);
}
a: 7 7 7 7 7 7 7 7 7 7 7 7
a: 0 0 0 0 1 1 1 1 2 2 2 2
a: 0 1 2 3 0 1 2 3 0 1 2 3 39. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Code Example - Kernel
Kernel is executed by N work-items
–Each work-item has a unique ID between 0 and N-1
39
void inc(float* a, float b,
int N)
{
for(int i = 0; i<N; i++)
a[i] = a[i] + b;
}
void main()
{
…
increment(a,b,N);
}
__kernel
void inc(__global float* a,
float b)
{
int i = get_global_id(0);
a[i] = a[i] + b;
}
void main()
{
…
clEnqueueNDRangeKernel(…,…);
} 40. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax - Kernels
All C operators are supported
–eg. +, *, /, ^, >, >>
Many functions from the standard math library
–eg. sin(), cos(), ceil(), fabs()
Can write/call your own non-kernel functions
–float myDeviceFunction(__global float *a)
–Non-kernel functions cannot be called by host
Control flow statements too!
–eg. if(), while(), for()
40 41. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Syntax - Synchronization
Kernel launches are asynchronous
–Control returns to CPU immediately
–Subsequent commands added to the command queue will wait until the kernel has completed
–If you want to synchronize on the host:
•Implicit synchronization via blocking commands
–eg. clEnqueueReadBuffer() with the blocking argument set to CL_TRUE
–Explicitly call clFinish()
clFinish(queue)
–Blocks on host until all outstanding OpenCL commands are complete in a given queue
41 42. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Questions?
OpenCL training courses and consulting services
Acceleware Ltd.
Twitter: @Acceleware
Web: http://acceleware.com/opencl-training
Email: services@acceleware.com
-------------------
Stay in the know about developer news, tools, SDKs, technical presentations, events and future webinars. Connect with AMD Developer Central here:
AMD Developer Central
Twitter: @AMDDevCentral
Web: http://developer.amd.com/
YouTube: https://www.youtube.com/user/AMDDevCentral
Developer Forums: http://devgurus.amd.com/welcome
42 43. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
An Overview of GPU Hardware 44. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
What is the GPU?
The GPU is a graphics processing unit
Historically used to offload graphics computations from the CPU
Can either be a dedicated video card, integrated on the motherboard or on the same die as the CPU
–Highest performance will require a dedicated video card
44 45. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Why use GPUs? Performance!
45
Intel Xeon E5-2697 v2 (Ivy Bridge)
AMD Opteron 6386SE (Bulldozer)
AMD FirePro
W9100 (Volcanic Islands)
AMD FirePro S10000 (Southern Islands)
Processing Cores
12
16
2816
3584
Clock Frequency (GHz)
2.7-3.4* GHz
2.8-3.5* GHz
930 MHz
825 MHz
Memory Bandwidth
59.7 GB/s / socket
59.7 GB/s / socket
320 GB/s
480 GB/s
Peak Gflops** (single)
576 @ 3.0GHz
410 @ 3.2GHz
5240
5910
Peak Gflops** (double)
288 @ 3.0GHz
205 @ 3.2GHz
2620
1480
Gflops/Watt
(single)
4.4
2.9
19
15.76
Total Memory
>>16GB
>>16GB
16 GB
6 GB
*Indicates range of clock frequencies supported via Intel Turbo Boost and AMD Turbo CORE Technology
** At maximum frequency when all cores are executing 46. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL
Using AMD GPUs
GPU Potential Advantages
9x more single-precision floating-point throughput
9x more double-precision floating-point throughput
5x higher memory bandwidth
46
AMD FirePro W9100 vs. Xeon E5-2697 v2
47. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
GPU Disadvantages
Architecture not as flexible as CPU
Must rewrite algorithms and maintain software in GPU languages
Attached to CPU via relatively slow PCIe
–16GB/s bi-directional for PCIe 3.0 16x
Limited memory (though 6-16GB is reasonable for many applications)
47 48. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
Software Approaches for Acceleration
Maximum Flexibility
–OpenCL
Simple programming for heterogeneous systems
–Simple compiler hints/pragmas
–Compiler parallelizes code
–Target a variety of platforms
“Drop-in” Acceleration
–In-depth GPU knowledge not required
–Highly optimized by GPU experts
–Provides functions used in a broad range of applications (eg. FFT, BLAS)
48
Programming Languages
OpenACC Directives
Libraries
Effort 49. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
An Introduction to OpenCL 50. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Overview
Parallel computing architecture standardized by the Khronos Group
OpenCL:
–Is a royalty free standard
–Provides an API to coordinate parallel computation across heterogeneous processors
–Defines a cross-platform programming language
50 51. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Versions
To date there are four different versions of OpenCL
–OpenCL 1.0
–OpenCL 1.1
–OpenCL 1.2
–OpenCL 2.0 (finalized November 2013)
Different versions support different functionality
51
Hardware Vendor
Supported OpenCL Version
AMD
OpenCL 1.2
Intel
OpenCL 1.2
NVIDIA
OpenCL 1.1 52. © 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An Introduction to OpenCL Using AMD GPUs
OpenCL Extensions
Optional functionality is exposed through extensions
–Vendors are not required to support extensions to achieve conformance
–However, extensions are expected to be widely available
Some OpenCL extensions are approved by the OpenCL working group
–These extensions are likely to be promoted to core functionality in future versions of the standard
Multi-vendor and vendor specific extensions do not need approval by the working group
52