OpenCL - The Open Standard for Heterogeneous Parallel Programming

OpenCL – The Open Standard for Heterogeneous
Parallel Programming

Dr.-Ing. Achim Basermann
Deutsches Zentrum für Luft- und Raumfahrt e.V.
Abteilung Verteilte Systeme und Komponentensoftware, Köln-Porz

Folie 1
20090714-1 TechTalk OpenCL SC-VK Basermann

Survey

Motivation
Trends in parallel processing
Role of OpenCL

Khronos´ OpenCL
Platform model
Execution model
Memory model
Programming model

Conclusions

Material used: OpenCL survey and specification from http://www.khronos.org/opencl/

Folie 2
20090714-1 TechTalk OpenCL SC-VK Basermannt

Motivation: Trends in Parallel Processing, HW

Trend #1: Multicore processor chips
Maintain (or even reduce) frequency while replicating cores

Trend #2: Accelerators (GPGPUs, e.g.)
Previously, processors would “catch” up with accelerator function in the
next generation
Accelerator design expense not amortized well
New accelerator designs more likely to maintain performance advantage
And will maintain an enormous power advantage for target workloads

Trend #2b: Heterogeneous multicore in general (IBM Cell, e.g.)
Mixes of powerful cores, smaller cores, and accelerators potentially offer
the most efficient nodes
The challenge is harnessing them efficiently

Folie 3

Motivation: Programming Issues

Many cores per node, and accelerators/heterogeneity

Future performance gains will come via massive parallelism (not clock speed)
An unwelcome situation for HPC apps!

Need new programming models to exploit

At the system/cluster level:
Message-passing to connect node-level languages, or
Global addressing to make communication implicit?

Folie 4

Motivation: Classic Programming Models

Folie 5

Motivation: OpenCL
New open standard that specifically addresses parallel compute accelerators

Extension to C

Provides data parallel and task parallel models

Facilitates natural transition from the growing number of CUDA (Compute
Unified Device Architecture, NVIDIA) programs

Porting of Cell applications to a standard model

Play wells with MPI

Can interoperate with Fortran and OpenMP

Folie 6

Motivation: Roles of OpenCL

Folie 7

Motivation: Before OpenCL

Folie 8

Motivation: The promise of OpenCL

Folie 9

: OpenCL

CPUs GPUs
Multiple cores driving Emerging Increasingly general purpose
performance increases data-parallel computing
Intersection Improving numerical precision

OpenCL
Heterogenous
Computing
Multi-processor Graphics APIs
programming – and Shading
e.g. OpenMP Languages

OpenCL ––Open Computing Language
OpenCL Open Computing Language
Open, royalty-free standard for portable, parallel programming of heterogeneous
Open, royalty-free standard for portable, parallel programming of heterogeneous
parallel computing CPUs, GPUs, and other processors
parallel computing CPUs, GPUs, and other processors
Folie 10

OpenCL Working Group
Diverse industry participation
Processor vendors, system OEMs, middleware vendors, application developers
Many industry-leading experts involved in OpenCL’s design
A healthy diversity of industry perspectives
Apple initially proposed and is very active in the working group
Serving as specification editor

Here are some of the other companies in the OpenCL working group

Folie 11

OpenCL Timeline
Six months from proposal to released specification
Due to a strong initial proposal and a shared commercial incentive to work quickly
Apple’s Mac OS X Snow Leopard will include OpenCL
Improving speed and responsiveness for a wide spectrum of applications
Multiple OpenCL implementations expected in the next 12 months
On diverse platforms
OpenCL
Apple works working group Khronos
with AMD, Intel, develops draft publicly releases
NVIDIA and into cross- OpenCL as
others on draft vendor royalty-free
proposal specification specification
Jun08 Oct08 May09
Dec08
Apple proposes Working Group Khronos to
OpenCL working sends release
group and completed draft conformance
contributes draft to Khronos tests to ensure
specification to Board for high-quality
Khronos Ratification implementations
Folie 12

Silicon Community
OpenCL: Part of the Khronos API Ecosystem
Desktop 3D
Ecosystem Parallel computing and
visualization in scientific and Software Community
3D Asset Interchange
Format consumer applications
Streamlined APIs for mobile and
embedded graphics, media and
Cross platform desktop 3D OpenCL compute acceleration
Heterogeneous
Parallel Computing

Embedded 3D

Streaming Media and
Image Processing
Vector 2D
OpenCL is at the center of an
emerging visual computing Enhanced Audio
ecosystem that includes 3D
graphics, video and image Surface and
synch abstraction
processing on desktop,
embedded and mobile systems

Integrated Mixed-media Stack
Umbrella specifications define
coherent acceleration stacks for
mobile application portability Mobile OS Abstraction

Folie 13

OpenCL: Platform Model

One Host + one or more Compute Devices
Each Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements

Folie 14

OpenCL: Execution Model
OpenCL Program:
Kernels
Basic unit of executable code — similar to C functions, CUDA kernels, etc.
Data-parallel or task-parallel

Host Program
Collection of compute kernels and internal functions
Analogous to a dynamic library

Kernel Execution
The host program invokes a kernel over an index space called an NDRange
NDRange, “N-Dimensional Range”, can be a 1D, 2D, or 3D space
A single kernel instance at a point in the index space is called a work-item
Work-items have unique global IDs from the index space
CUDA: thread IDs
Work-items are further grouped into work-groups
Work-groups have a unique work-group ID
Work-items have a unique local ID within a work-group
CUDA: Block IDs
Folie 15

OpenCL: Execution Model,
example 2D NDRange

Total number of work-items = Gx * Gy
Size of each work-group = Sx * Sy
Global ID can be computed from work-group ID and local ID

Folie 16

OpenCL: Execution Model
Contexts are used to contain and manage the state of the “world”

Kernels are executed in contexts defined and manipulated by the host
Devices
Kernels - OpenCL functions
Program objects - kernel source and executable
Memory objects

Command-queue - coordinates execution of kernels
Kernel execution commands
Memory commands: Transfer or map memory object data
Synchronization commands: Constrain the order of commands

Execution order of commands
Launched in-order
Executed in-order or out-of-order
Events are used to implement appropriate synchronization of execution instances

Folie 17

OpenCL: Memory Model
Shared memory
Relaxed consistency (similar to CUDA)
Multiple distinct address spaces
(which can be collapsed)
Private memory (to a work item)
- Qualifier: __private; Ex.: __private char *px;
PE PE PE PE
- local memory in CUDA
Local memory (to work group)
- Qualifier: __local Local Memory Local Memory

- Shared memory in CUDA
Global memory Global / Constant Memory Data Cache

- Qualifier: __global; Ex.: __global float4 *p; Compute Device

- Global memory in CUDA
Constant memory Global Memory
- Qualifier: __constant Compute Device Memory
- Constant memory in CUDA

Folie 18

OpenCL: Memory Consistency

A relaxed consistency memory model

Across work-items (CUDA: threads) no consistency

Within a work-item (CUDA: thread) load/store consistency

Consistency of memory shared between commands is enforced
through synchronization

Folie 19

OpenCL: Programming Model

Data-Parallel Model

Must be implemented by all OpenCL compute devices

Define N-Dimensional computation domain
Each independent element of execution in an N-Dimensional domain is called a
work-item
N-Dimensional domain defines total # of work-items that execute in parallel
= global work size

Work-items can be grouped together — work-group
Work-items in group can communicate with each other
Can synchronize execution among work-items in group to coordinate memory access

Execute multiple work-groups in parallel
Mapping of global work size to work-group can be implicit or explicit

Folie 20

OpenCL: Programming Model

Task-Parallel Model

Some compute devices can also execute task-parallel compute kernels

Execute as a single work-item

Users express parallelism by
using vector data types implemented by the device,
enqueuing multiple tasks (compute kernels written in OpenCL), and/or
enqueing native kernels developed using a programming model
orthogonal to OpenCL (native C / C++ functions, e.g.).

Folie 21

OpenCL: Programming Model, Synchronization

Work-items in a single work-group (work-group barrier)
Similar to _synchthreads () in CUDA

No mechanism for synchronization between work-groups

Synchronization points between commands in command-queues
Similar to multiple kernels in CUDA but more generalized
Command-queue barrier
Waiting on an event

Folie 22

OpenCL C for Compute Kernels
Derived from ISO C99
A few restrictions: recursion, function pointers, functions in C99 standard headers ...
Preprocessing directives defined by C99 are supported

Built-in Data Types
Scalar and vector data types, Pointers
Data-type conversion functions: convert_type<_sat><_roundingmode>
Image types: image2d_t, image3d_t and sampler_t

Built-in Functions — Required
Work-item functions, math.h, read and write image
Relational, geometric functions, synchronization functions

Built-in Functions — Optional
double precision (latest CUDA supports this), atomics to global and local memory
selection of rounding mode, writes to image3d_t surface

Folie 23

OpenCL C Language Highlights
Function qualifiers
“__kernel” qualifier declares a function as a kernel
Kernels can call other kernel functions
Address space qualifiers
__global, __local, __constant, __private
Pointer kernel arguments must be declared with an address space qualifier
Work-item functions
Query work-item identifiers
get_work_dim()
get_global_id(), get_local_id(), get_group_id()
Image functions
Images must be accessed through built-in functions
Reads/writes performed through sampler objects from host or defined in source
Synchronization functions
Barriers - all work-items within a work-group must execute the barrier function before
any work-item can continue
Memory fences - provides ordering between memory operations

Folie 24

OpenCL C Language Restrictions

Pointers to functions not allowed
Pointers to pointers allowed within a kernel, but not as an argument
Bit-fields not supported
Variable-length arrays and structures not supported
Recursion not supported
Writes to a pointer of types less than 32-bit not supported
Double types not supported, but reserved
3D Image writes not supported

Some restrictions are addressed through extensions

Folie 25

Basic OpenCL Program Structure

Folie 26

OpenCL: Kernel Code Example

__kernel void VectorAdd(
Simple element by __global const float* a,
element vector addition __global const float* b,
__global float* c)
{
For all i, int iGID = get_global_id(0);
C(i) = A(i) + B(i) c[iGID] = a[iGID] + b[iGID];
}

Folie 27

OpenCL VectorAdd: Contexts and Queues

cl_context cxMainContext; // OpenCL context
cl_command_queue cqCommandQue; // OpenCL command queue
cl_device_id* cdDevices; // OpenCL device list
size_t szParmDataBytes; // byte length of parameter storage

// create the OpenCL context on a GPU device
cxMainContext = clCreateContextFromType (0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

// get the list of GPU devices associated with context
clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, 0, NULL, &szParmDataBytes);
cdDevices = (cl_device_id*)malloc(szParmDataBytes);
clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, szParmDataBytes, cdDevices, NULL);

// create a command-queue
cqCommandQue = clCreateCommandQueue (cxMainContext, cdDevices[0], 0, NULL);

Folie 28

OpenCL VectorAdd: Create Memory Objects,
Program and Kernel
Create Memory Objects
// allocate the first source buffer memory object… source data, so read only
…
// allocate the second source buffer memory object … source data, so read only
…
// allocate the destination buffer memory object … result data, so write only
…
Create Program and Kernel
// create the program
…
// build the program
…
// create the kernel
...
// set the kernel Argument values
…

Folie 29

OpenCL VectorAdd: Launch Kernel
cl_command_queue cqCommandQue; // OpenCL command queue
cl_kernel ckKernel; // OpenCL kernel “VectorAdd”
size_t szGlobalWorkSize[1]; // Global # of work items
size_t szLocalWorkSize[1]; // # of Work Items in Work Group
int iTestN = 10000; // Length of demo test vectors

// set work-item dimensions
szGlobalWorkSize[0] = iTestN;
szLocalWorkSize[0]= 1;

// execute kernel
ciErrNum = clEnqueueNDRangeKernel (cqCommandQue, ckKernel, 1, NULL,
szGlobalWorkSize, szLocalWorkSize,
0, NULL, NULL);

// Cleanup: release kernel, program, and memory objects
…

Folie 30

Summary: OpenCL versus CUDA

OpenCL CUDA

Execution Model Work-groups/work-items Block/Thread

Memory model Global/constant/local/ Global/constant/shared/
private local
+ texture
Memory consistency Weak consistency Weak consistency

Synchronization Synchronization using a Using synch_threads
work-group barrier between threads
(between work-items)

Folie 31

Conclusions
OpenCL meets the trends in HPC

Supports multicore SMPs
Supports accelerators (in particular GPGPUs)
Allows programming of heterogenous processors
Supports vectorization (SIMD operations)
Explicitly supports parallel image processing
Suitable for massive parallelism: OpenCL + MPI

OpenCL is a low-level parallel language and complicated.

If you master it you master (heterogeneous) parallel programming.

It might be the basis for new high-level parallel languages.

Folie 32

Questions?

Folie 33

Next TechTalk on CUDA: Wednesday, July 22, 2009
(15:00-16:00, Raum-KP-2b-06, Funktional)

CUDA - The Compute Unified Device Architecture
from NVIDIA

Jens Rühmkorf

Folie 34

Additional material: OpenCL Demos

NVIDIA: http://www.youtube.com/watch?v=PJ1jydg8mLg

AMD (1): http://www.youtube.com/watch?v=MCaGb40Bz58&feature=related

AMD (2): http://www.youtube.com/watch?v=mcU89Td53Gg

Folie 35

OpenCL - The Open Standard for Heterogeneous Parallel Programming

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à OpenCL - The Open Standard for Heterogeneous Parallel Programming

Similaire à OpenCL - The Open Standard for Heterogeneous Parallel Programming (20)

Plus de Andreas Schreiber

Plus de Andreas Schreiber (20)

Dernier

Dernier (20)

OpenCL - The Open Standard for Heterogeneous Parallel Programming