[Webinar] SpiraTest - Setting New Standards in Quality Assurance
OpenCL - The Open Standard for Heterogeneous Parallel Programming
1. OpenCL – The Open Standard for Heterogeneous
Parallel Programming
Dr.-Ing. Achim Basermann
Deutsches Zentrum für Luft- und Raumfahrt e.V.
Abteilung Verteilte Systeme und Komponentensoftware, Köln-Porz
Folie 1
20090714-1 TechTalk OpenCL SC-VK Basermann
2. Survey
Motivation
Trends in parallel processing
Role of OpenCL
Khronos´ OpenCL
Platform model
Execution model
Memory model
Programming model
Conclusions
Material used: OpenCL survey and specification from http://www.khronos.org/opencl/
Folie 2
20090714-1 TechTalk OpenCL SC-VK Basermannt
3. Motivation: Trends in Parallel Processing, HW
Trend #1: Multicore processor chips
Maintain (or even reduce) frequency while replicating cores
Trend #2: Accelerators (GPGPUs, e.g.)
Previously, processors would “catch” up with accelerator function in the
next generation
Accelerator design expense not amortized well
New accelerator designs more likely to maintain performance advantage
And will maintain an enormous power advantage for target workloads
Trend #2b: Heterogeneous multicore in general (IBM Cell, e.g.)
Mixes of powerful cores, smaller cores, and accelerators potentially offer
the most efficient nodes
The challenge is harnessing them efficiently
Folie 3
20090714-1 TechTalk OpenCL SC-VK Basermannt
4. Motivation: Programming Issues
Many cores per node, and accelerators/heterogeneity
Future performance gains will come via massive parallelism (not clock speed)
An unwelcome situation for HPC apps!
Need new programming models to exploit
At the system/cluster level:
Message-passing to connect node-level languages, or
Global addressing to make communication implicit?
Folie 4
20090714-1 TechTalk OpenCL SC-VK Basermannt
6. Motivation: OpenCL
New open standard that specifically addresses parallel compute accelerators
Extension to C
Provides data parallel and task parallel models
Facilitates natural transition from the growing number of CUDA (Compute
Unified Device Architecture, NVIDIA) programs
Porting of Cell applications to a standard model
Play wells with MPI
Can interoperate with Fortran and OpenMP
Folie 6
20090714-1 TechTalk OpenCL SC-VK Basermannt
10. : OpenCL
CPUs GPUs
Multiple cores driving Emerging Increasingly general purpose
performance increases data-parallel computing
Intersection Improving numerical precision
OpenCL
Heterogenous
Computing
Multi-processor Graphics APIs
programming – and Shading
e.g. OpenMP Languages
OpenCL ––Open Computing Language
OpenCL Open Computing Language
Open, royalty-free standard for portable, parallel programming of heterogeneous
Open, royalty-free standard for portable, parallel programming of heterogeneous
parallel computing CPUs, GPUs, and other processors
parallel computing CPUs, GPUs, and other processors
Folie 10
20090714-1 TechTalk OpenCL SC-VK Basermannt
11. OpenCL Working Group
Diverse industry participation
Processor vendors, system OEMs, middleware vendors, application developers
Many industry-leading experts involved in OpenCL’s design
A healthy diversity of industry perspectives
Apple initially proposed and is very active in the working group
Serving as specification editor
Here are some of the other companies in the OpenCL working group
Folie 11
20090714-1 TechTalk OpenCL SC-VK Basermannt
12. OpenCL Timeline
Six months from proposal to released specification
Due to a strong initial proposal and a shared commercial incentive to work quickly
Apple’s Mac OS X Snow Leopard will include OpenCL
Improving speed and responsiveness for a wide spectrum of applications
Multiple OpenCL implementations expected in the next 12 months
On diverse platforms
OpenCL
Apple works working group Khronos
with AMD, Intel, develops draft publicly releases
NVIDIA and into cross- OpenCL as
others on draft vendor royalty-free
proposal specification specification
Jun08 Oct08 May09
Dec08
Apple proposes Working Group Khronos to
OpenCL working sends release
group and completed draft conformance
contributes draft to Khronos tests to ensure
specification to Board for high-quality
Khronos Ratification implementations
Folie 12
20090714-1 TechTalk OpenCL SC-VK Basermannt
13. Silicon Community
OpenCL: Part of the Khronos API Ecosystem
Desktop 3D
Ecosystem Parallel computing and
visualization in scientific and Software Community
3D Asset Interchange
Format consumer applications
Streamlined APIs for mobile and
embedded graphics, media and
Cross platform desktop 3D OpenCL compute acceleration
Heterogeneous
Parallel Computing
Embedded 3D
Streaming Media and
Image Processing
Vector 2D
OpenCL is at the center of an
emerging visual computing Enhanced Audio
ecosystem that includes 3D
graphics, video and image Surface and
synch abstraction
processing on desktop,
embedded and mobile systems
Integrated Mixed-media Stack
Umbrella specifications define
coherent acceleration stacks for
mobile application portability Mobile OS Abstraction
Folie 13
20090714-1 TechTalk OpenCL SC-VK Basermannt
14. OpenCL: Platform Model
One Host + one or more Compute Devices
Each Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements
Folie 14
20090714-1 TechTalk OpenCL SC-VK Basermannt
15. OpenCL: Execution Model
OpenCL Program:
Kernels
Basic unit of executable code — similar to C functions, CUDA kernels, etc.
Data-parallel or task-parallel
Host Program
Collection of compute kernels and internal functions
Analogous to a dynamic library
Kernel Execution
The host program invokes a kernel over an index space called an NDRange
NDRange, “N-Dimensional Range”, can be a 1D, 2D, or 3D space
A single kernel instance at a point in the index space is called a work-item
Work-items have unique global IDs from the index space
CUDA: thread IDs
Work-items are further grouped into work-groups
Work-groups have a unique work-group ID
Work-items have a unique local ID within a work-group
CUDA: Block IDs
Folie 15
20090714-1 TechTalk OpenCL SC-VK Basermannt
16. OpenCL: Execution Model,
example 2D NDRange
Total number of work-items = Gx * Gy
Size of each work-group = Sx * Sy
Global ID can be computed from work-group ID and local ID
Folie 16
20090714-1 TechTalk OpenCL SC-VK Basermannt
17. OpenCL: Execution Model
Contexts are used to contain and manage the state of the “world”
Kernels are executed in contexts defined and manipulated by the host
Devices
Kernels - OpenCL functions
Program objects - kernel source and executable
Memory objects
Command-queue - coordinates execution of kernels
Kernel execution commands
Memory commands: Transfer or map memory object data
Synchronization commands: Constrain the order of commands
Execution order of commands
Launched in-order
Executed in-order or out-of-order
Events are used to implement appropriate synchronization of execution instances
Folie 17
20090714-1 TechTalk OpenCL SC-VK Basermannt
18. OpenCL: Memory Model
Shared memory
Relaxed consistency (similar to CUDA)
Multiple distinct address spaces
(which can be collapsed)
Private memory (to a work item)
- Qualifier: __private; Ex.: __private char *px;
PE PE PE PE
- local memory in CUDA
Local memory (to work group)
- Qualifier: __local Local Memory Local Memory
- Shared memory in CUDA
Global memory Global / Constant Memory Data Cache
- Qualifier: __global; Ex.: __global float4 *p; Compute Device
- Global memory in CUDA
Constant memory Global Memory
- Qualifier: __constant Compute Device Memory
- Constant memory in CUDA
Folie 18
20090714-1 TechTalk OpenCL SC-VK Basermannt
19. OpenCL: Memory Consistency
A relaxed consistency memory model
Across work-items (CUDA: threads) no consistency
Within a work-item (CUDA: thread) load/store consistency
Consistency of memory shared between commands is enforced
through synchronization
Folie 19
20090714-1 TechTalk OpenCL SC-VK Basermannt
20. OpenCL: Programming Model
Data-Parallel Model
Must be implemented by all OpenCL compute devices
Define N-Dimensional computation domain
Each independent element of execution in an N-Dimensional domain is called a
work-item
N-Dimensional domain defines total # of work-items that execute in parallel
= global work size
Work-items can be grouped together — work-group
Work-items in group can communicate with each other
Can synchronize execution among work-items in group to coordinate memory access
Execute multiple work-groups in parallel
Mapping of global work size to work-group can be implicit or explicit
Folie 20
20090714-1 TechTalk OpenCL SC-VK Basermannt
21. OpenCL: Programming Model
Task-Parallel Model
Some compute devices can also execute task-parallel compute kernels
Execute as a single work-item
Users express parallelism by
using vector data types implemented by the device,
enqueuing multiple tasks (compute kernels written in OpenCL), and/or
enqueing native kernels developed using a programming model
orthogonal to OpenCL (native C / C++ functions, e.g.).
Folie 21
20090714-1 TechTalk OpenCL SC-VK Basermannt
22. OpenCL: Programming Model, Synchronization
Work-items in a single work-group (work-group barrier)
Similar to _synchthreads () in CUDA
No mechanism for synchronization between work-groups
Synchronization points between commands in command-queues
Similar to multiple kernels in CUDA but more generalized
Command-queue barrier
Waiting on an event
Folie 22
20090714-1 TechTalk OpenCL SC-VK Basermannt
23. OpenCL C for Compute Kernels
Derived from ISO C99
A few restrictions: recursion, function pointers, functions in C99 standard headers ...
Preprocessing directives defined by C99 are supported
Built-in Data Types
Scalar and vector data types, Pointers
Data-type conversion functions: convert_type<_sat><_roundingmode>
Image types: image2d_t, image3d_t and sampler_t
Built-in Functions — Required
Work-item functions, math.h, read and write image
Relational, geometric functions, synchronization functions
Built-in Functions — Optional
double precision (latest CUDA supports this), atomics to global and local memory
selection of rounding mode, writes to image3d_t surface
Folie 23
20090714-1 TechTalk OpenCL SC-VK Basermannt
24. OpenCL C Language Highlights
Function qualifiers
“__kernel” qualifier declares a function as a kernel
Kernels can call other kernel functions
Address space qualifiers
__global, __local, __constant, __private
Pointer kernel arguments must be declared with an address space qualifier
Work-item functions
Query work-item identifiers
get_work_dim()
get_global_id(), get_local_id(), get_group_id()
Image functions
Images must be accessed through built-in functions
Reads/writes performed through sampler objects from host or defined in source
Synchronization functions
Barriers - all work-items within a work-group must execute the barrier function before
any work-item can continue
Memory fences - provides ordering between memory operations
Folie 24
20090714-1 TechTalk OpenCL SC-VK Basermannt
25. OpenCL C Language Restrictions
Pointers to functions not allowed
Pointers to pointers allowed within a kernel, but not as an argument
Bit-fields not supported
Variable-length arrays and structures not supported
Recursion not supported
Writes to a pointer of types less than 32-bit not supported
Double types not supported, but reserved
3D Image writes not supported
Some restrictions are addressed through extensions
Folie 25
20090714-1 TechTalk OpenCL SC-VK Basermannt
27. OpenCL: Kernel Code Example
__kernel void VectorAdd(
Simple element by __global const float* a,
element vector addition __global const float* b,
__global float* c)
{
For all i, int iGID = get_global_id(0);
C(i) = A(i) + B(i) c[iGID] = a[iGID] + b[iGID];
}
Folie 27
20090714-1 TechTalk OpenCL SC-VK Basermannt
28. OpenCL VectorAdd: Contexts and Queues
cl_context cxMainContext; // OpenCL context
cl_command_queue cqCommandQue; // OpenCL command queue
cl_device_id* cdDevices; // OpenCL device list
size_t szParmDataBytes; // byte length of parameter storage
// create the OpenCL context on a GPU device
cxMainContext = clCreateContextFromType (0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);
// get the list of GPU devices associated with context
clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, 0, NULL, &szParmDataBytes);
cdDevices = (cl_device_id*)malloc(szParmDataBytes);
clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, szParmDataBytes, cdDevices, NULL);
// create a command-queue
cqCommandQue = clCreateCommandQueue (cxMainContext, cdDevices[0], 0, NULL);
Folie 28
20090714-1 TechTalk OpenCL SC-VK Basermannt
29. OpenCL VectorAdd: Create Memory Objects,
Program and Kernel
Create Memory Objects
// allocate the first source buffer memory object… source data, so read only
…
// allocate the second source buffer memory object … source data, so read only
…
// allocate the destination buffer memory object … result data, so write only
…
Create Program and Kernel
// create the program
…
// build the program
…
// create the kernel
...
// set the kernel Argument values
…
Folie 29
20090714-1 TechTalk OpenCL SC-VK Basermannt
30. OpenCL VectorAdd: Launch Kernel
cl_command_queue cqCommandQue; // OpenCL command queue
cl_kernel ckKernel; // OpenCL kernel “VectorAdd”
size_t szGlobalWorkSize[1]; // Global # of work items
size_t szLocalWorkSize[1]; // # of Work Items in Work Group
int iTestN = 10000; // Length of demo test vectors
// set work-item dimensions
szGlobalWorkSize[0] = iTestN;
szLocalWorkSize[0]= 1;
// execute kernel
ciErrNum = clEnqueueNDRangeKernel (cqCommandQue, ckKernel, 1, NULL,
szGlobalWorkSize, szLocalWorkSize,
0, NULL, NULL);
// Cleanup: release kernel, program, and memory objects
…
Folie 30
20090714-1 TechTalk OpenCL SC-VK Basermannt
31. Summary: OpenCL versus CUDA
OpenCL CUDA
Execution Model Work-groups/work-items Block/Thread
Memory model Global/constant/local/ Global/constant/shared/
private local
+ texture
Memory consistency Weak consistency Weak consistency
Synchronization Synchronization using a Using synch_threads
work-group barrier between threads
(between work-items)
Folie 31
20090714-1 TechTalk OpenCL SC-VK Basermannt
32. Conclusions
OpenCL meets the trends in HPC
Supports multicore SMPs
Supports accelerators (in particular GPGPUs)
Allows programming of heterogenous processors
Supports vectorization (SIMD operations)
Explicitly supports parallel image processing
Suitable for massive parallelism: OpenCL + MPI
OpenCL is a low-level parallel language and complicated.
If you master it you master (heterogeneous) parallel programming.
It might be the basis for new high-level parallel languages.
Folie 32
20090714-1 TechTalk OpenCL SC-VK Basermannt
34. Next TechTalk on CUDA: Wednesday, July 22, 2009
(15:00-16:00, Raum-KP-2b-06, Funktional)
CUDA - The Compute Unified Device Architecture
from NVIDIA
Jens Rühmkorf
Folie 34
20090714-1 TechTalk OpenCL SC-VK Basermannt