SlideShare une entreprise Scribd logo
1  sur  72
Télécharger pour lire hors ligne
Adapting Languages for Parallel Processing on
GPUs
Neil Henning – Technology Lead

Neil Henning
neil@codeplay.com
Agenda

●

Introduction

●

Current landscape

●

What is wrong with the current landscape

●

How to enable your language on GPUs

●

Developing tools for GPUs
Neil Henning
neil@codeplay.com
Introduction

Neil Henning
neil@codeplay.com
Introduction – who am I?

●

Five years in the industry

●

Spent all of that using SPUs, GPUs, vectors units &

DSPs
●

Last two years focused on open standards (mostly

OpenCL)
●

Passionate about making compute easy

Neil Henning
neil@codeplay.com
Introduction – who are we?

●

GPU Compiler Experts based out of Edinburgh, Scotland

●

35 employees working on contracts, R&D and internal tech
Neil Henning
neil@codeplay.com
Current Landscape

Neil Henning
neil@codeplay.com
Current Landscape

●

Languages – CUDA, RenderScript, C++AMP & OpenCL

●

Targets – GPU (mobile & desktop), CPU (scalar & vector), DSPs, FPGAs

●

Concerns – performance, power, precision, parallelism & portability

Neil Henning
neil@codeplay.com
Current Landscape - CUDA

__global__ void kernel(char * a, char * b)
{
a[blockIdx.x] = b[blockIdx.x];
}

char in[SIZE], out[SIZE];
char * cIn, * cOut;
cudaMalloc((void **)&cIn, SIZE);
cudaMalloc((void **)&cOut, SIZE);
cudaMemcpy(cIn, in, size,
cudaMemcpyHostToDevice);
kernel<<<SIZE, 1>>>(cOut, cIn);
cudaMemcpy(out, cOut, size,
cudaMemcpyDeviceToHost);
cudaFree(cIn);
cudaFree(cOut);

●

CUDA incredibly established

●

●

First major GPU compute approach to market

majority of devices

●

Huge bank of tools, libraries and knowledge

●

Really only had uptake in offline processing

●

Used in banking, medical imaging, game asset

●

Standard isn’t open, little room (or enthusiasm) for

creation, and many many more uses!

Using CUDA means abandoning compute on

other vendors to implement
Neil Henning
neil@codeplay.com
Current Landscape - RenderScript
#pragma version(1)
#pragma rs java_package_name(foo)
rs_allocation gIn; rs_allocation gOut;
rs_script gScript;
void root(const char * in, char * out,
const void * usr, uint32_t x, uint32_t y) {
*out = *in;
}
void filter() {
rsForEach(gScript, gIn, gOut, NULL);
}

Context ctxt = /* … */;
RenderScript rs = RenderScript.create(ctxt);
ScriptC_foo script = new ScriptC_foo(rs,
getResources(), R.raw.foo);
Allocation in = Allocation.createSized(rs,
Element.I8(rs), SIZE);
Allocation out = Allocation.createSized(rs,
Element.I8(rs), SIZE);
script.set_gIn(in); script.set_gOut(out);
script.set_gScript(script);
script.invoke_filter();

●

Intelligent runtime load balances kernels

●

Only on Android

●

Creates Java classes to interface with kernels

●

Limited documentation & shortage of examples

●

Focused on performance portability

●

No real idea of feature roadmap
Neil Henning
neil@codeplay.com
Current Landscape – C++AMP

int in[SIZE], out[SIZE];
array_view<const int, 1> aIn(SIZE, in);
array_view<int, 1> aOut(SIZE, out);
aOut.discard_data();
parallel_for_each(aOut.extent,
[=](index<1> idx) restrict(amp)
{
aOut[idx] = aIn[idx];
}
);

●

Very well thought out single source approach

●

Lovely use of C++ templates to capture type information,

array dimensions
●

Great use of C++11 Lambda’s for capturing kernel intent

●

Part of target community is really C++11 averse, need

convincing
Limited low-level support

●

Initial interest by community faded fast

●

// can access aOut[…] like normal

●

Xbox One will support C++AMP – watch this space 

Neil Henning
neil@codeplay.com
Current Landscape - OpenCL

void kernel foo(global int * a, global int * b)
{
int idx = get_global_id(0);
a[idx] = b[idx];
}

// device, context, queue, in, out already created
cl_program program =
clCreateProgramWithSource(context, 1,
fooAsStr, NULL, NULL);
clBuildProgram(program, 1, &device,
NULL, NULL, NULL);
cl_kernel kernel = clCreateKernel(program,
“foo”, NULL);
// set kernel arguments
clEnqueueNDRangeKernel(queue, kernel, 1,
NULL, &size, NULL, 0, NULL, NULL);

●

Open standard with many contributors

●

API is verbose, very very verbose!

●

API puts control in developer hands

●

Steep learning curve for new developers

●

Support on lots of heterogeneous platforms – not just GPUs!

●

Have to support diverse range of application types
Neil Henning
neil@codeplay.com
Current Landscape

Modern systems have many compute-capable devices in them

Not unlike the fictitious system shown above!
Neil Henning
neil@codeplay.com
Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use

Mostly a fallback target for

compute currently

Neil Henning
neil@codeplay.com
Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if
kernel has vector types

Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target
Neil Henning
neil@codeplay.com
Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if
kernel has vector types

Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target

Can make no assumptions as to
what DSPs ‘look’ like

Digital Signal Processors (DSPs)
are a future target for the compute market
Neil Henning
neil@codeplay.com
Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if
kernel has vector types

Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target

GPUs do not forgive poor code like a CPU or even a DSP
could, require large arrays of work to utilize

GPUs are the reason we have

compute in the first place

Can make no assumptions as to
what they ‘look’ like

Digital Signal Processors (DSPs)
are a future target for the compute market
Neil Henning
neil@codeplay.com
Current Landscape

●

●

Have to weigh up many competing concerns for languages

Platform, operating system, device type, battery life, use case
Neil Henning
neil@codeplay.com
What is wrong with the current landscape

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

Compute approaches are not on all device and OS combinations

●

No CUDA on AMD, RenderScript on iOS or C++AMP on Linux

●

Have to support offline precise compute & time-bound online compute

●

Very divergent targets/use cases/device types is problematic!

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

What if loop count is always multiple of four?

void foo(int * a, int * b, int * count)
{
for(int idx = 0; idx < *(count); ++idx)
{
a[idx] = 42 * b[idx];
}
}

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

void foo(int * a, int * b, int * count)
{
for(int idx = 0; idx < *(count); idx += 4)
{
a[idx + 0] = 42 * b[idx + 0];
a[idx + 1] = 42 * b[idx + 1];
a[idx + 2] = 42 * b[idx + 2];
a[idx + 3] = 42 * b[idx + 3];
}
}

What if loop count is always multiple of four?

●

Can unroll the loop four times!

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

void foo(int * a, int * b, int * count)
{
for(int idx = 0; idx < *(count); idx += 4)
{
a[idx + 0] = 42 * b[idx + 0];
a[idx + 1] = 42 * b[idx + 1];
a[idx + 2] = 42 * b[idx + 2];
a[idx + 3] = 42 * b[idx + 3];
}
}

What if loop count is always multiple of four?

●

Can unroll the loop four times!

●

What if pointers a & b are sixteen byte aligned?

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

What if loop count is always multiple of four?

●

Can unroll the loop four times!

●

What if pointers a & b are sixteen byte aligned?

●

void foo(int * a, int * b, int * count)
{
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;

Can vectorize the loop body!

for(int idx = 0; idx < vecCount; ++idx)
{
vA[idx] = vB[idx] * (int4 )42;
}
}

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

for(int idx = 0; idx < vecCount; ++idx)
{
vA[idx] = vB[idx] * (int4 )42;
}

●

What if loop count is always multiple of four?

●

Can unroll the loop four times!

●

What if pointers a & b are sixteen byte aligned?

●

void foo(int * a, int * b, int * count)
{
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;

Can vectorize the loop body!

●

Why does my code look so radically different now?

}

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

for(int idx = 0; idx < vecCount; ++idx)
{
vA[idx] = vB[idx] * (int4 )42;
}

●

What if loop count is always multiple of four?

●

Can unroll the loop four times!

●

What if pointers a & b are sixteen byte aligned?

●

void foo(int * a, int * b, int * count)
{
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;

Can vectorize the loop body!

●

Why does my code look so radically different now?

●

Current languages force drastic developer interventions

}

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

void foo(int * a, int * b, int * count)
{
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;
for(int idx = 0; idx < vecCount; ++idx)
{
vA[idx] = vB[idx] * (int4 )42;
}

●

Existing languages (mostly) force developers to do coding

wizardry that is unnecessary

●

Also no real feedback to developer as ‘main’ compute

target has highly secretive ISAs

●

Don’t want to force vendors to reveal secrets, but do want

ability to influence kernel code generation

}

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

Rely on vendors to provide tools to aid development

●

Debuggers, profilers, static analysis all increasingly required

●

Libraries can vastly decrease development time

●

Rely solely on vendors to provide all these complicated pieces

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

Vendors already have lots of targets to support

●

Every generation of devices need to test conformance

●

Need to support compilers, graphics, compute, tools, list goes on!

●

Why should the vendor be the only one taking the burden?

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

No one can agree on what is the ‘best’ approach

●

Personal preference of developer/organization sways opinions

●

Why not allow Lisp on a GPU? Lua on a DSP?

●

Vendor doesn’t need extra headache of supporting these niche use cases

Neil Henning
neil@codeplay.com
What is wrong with the current landscape

●

My pitch – let community support compute standards

●

Take the approach of LLVM & Clang

●

Vendor has to support lower standard on their hardware

●

But allows community to support & innovate

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

First step – be able to compile language to a binary

●

Can’t output real binary though

●

Vendor doesn’t want to expose ISA

●

Developer wants portability of compiled kernels

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Need to use an Intermediate Representation (IR)

●

Two approaches in development for this!

●

HSA Intermediate Language (HSAIL)

●

OpenCL Standard Portable Intermediate Representation (SPIR)

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

Our
Language

Our
Language

●

Language -> LLVM IR -> HSAIL

●

Language -> LLVM IR -> SPIR

●

Low level mapping onto hardware, more of a virtual ISA

●

Then pass SPIR to OpenCL runtime as binary

●

Execute like normal OpenCL C Language kernel

●

Provisional specification available!

than an IR
●

HSAIL heavily in development

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

Our
Language

●

HSA will provide a low-level runtime to interface

between HSA compiled binaries and OS

Our
Language

●

OpenCL SPIR will require a SPIR compliant OpenCL

implementation as target

●

HSAIL is being standardized and ratified

●

Can compile using LLVM, then use

●

Existing JIT’ed languages potential targets

clCreateProgramWithBinary, passing SPIR options
Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

At present, SPIR is only target we can investigate

●

Intel has OpenCL drivers with provisional SPIR support

●

Can use Clang -> LLVM -> SPIR, then use Intel’s OpenCL to consume SPIR

●

Can take code that compiles to LLVM and run it on OpenCL

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Various steps to getting your language working on GPUs with SPIR

●

We’ll use Intel’s OpenCL SDK with provisional SPIR support;
1.

Create a test harness to load a SPIR binary

2.

Create a simple kernel using Intel’s SPIR compiler on host

3.

Create a simple kernel using tip Clang (language OpenCL) targeting SPIR

4.

Try other languages that compile to LLVM with SPIR target

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

// some SPIR bitcode file
const unsigned char spir_bc[spir_bc_length];
// already initialized platform, device & context for a SPIR compliant device
cl_platform_id platform = ... ;
cl_device device = ... ;
cl_context context = … ;
// create our program with our SPIR bitcode file
cl_program program = clCreateProgramWithBinary(
context, 1, &device, &spir_bc_length, &spir_bc, NULL, NULL);
// build, passing arguments telling the compiler language is SPIR, and the SPIR standard we are using
clBuildProgram(program, 1, &device, “–x spir –spir–std=1.2”, NULL, NULL);

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

// already initialized memory buffers for our context
cl_mem in_mem = ... ;
cl_mem out_mem = ... ;
// assume our kernel function from the spir kernel was called foo
cl_kernel kernel = clCreateKernel(program, “foo”, NULL);
// assume our kernel has one read buffer as first argument, and one write buffer as second
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void * )&in_mem);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void * )&out_mem);

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

// already initialized command queue
cl_command_queue queue = … ;
cl_event write_event, run_event;
clEnqueueWriteBuffer(queue, in_mem, CL_FALSE, 0, BUFFER_SIZE,
&read_payload, 0, NULL, &write_event);
const size_t size = BUFFER_SIZE / sizeof(cl_int);
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 1, &write_event, &run_event);
clEnqueueReadBuffer(queue, out_mem, CL_TRUE, 0, BUFFER_SIZE,
&result_payload, 1, &run_event, NULL);

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Now, create a simple OpenCL kernel

void kernel foo(global int * in, global int * out)
{
out[get_global_id(0)] = in[get_global_id(0)];
}

●

And use Intel’s command line (or GUI!) tool to build

Ioc32 –cmd=build –input foo.cl –spir32=foo.bc

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Next we point the buffer for our SPIR kernel at the generated SPIR kernel

●

And it fails…?
●

Turns out Intel’s OpenCL runtime doesn’t like us telling them they are building

SPIR!
●

Simply remove “–x spir –spir–std=1.2” from the build options and voila!

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Next step – use tip Clang to build our foo.cl kernel

clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.cl –o foo.bc

●

Compiles ok, but when we run it fails…?
●

So Clang generated SPIR bitcode file could very well not work

●

We’ll take a look at the readable IR for the Intel & Clang compiled kernels

Neil Henning
neil@codeplay.com
How to enable your language on GPUs
●

Clang Output

; ModuleID = 'ex.cl'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
target triple = "spir-unknown-unknown"
; Function Attrs: nounwind
define void @foo(i32 addrspace(1)* nocapture readonly %a, i32
addrspace(1)* nocapture %b) #0 {
entry:
%0 = load i32 addrspace(1)* %a, align 4, !tbaa !2
store i32 %0, i32 addrspace(1)* %b, align 4, !tbaa !2
ret void
}

attributes #0 = { nounwind "less-precise-fpmad"="false" "noframe-pointer-elim"="false" "no-infs-fp-math"="false" "no-nansfp-math"="false" "no-realign-stack" "stack-protector-buffersize"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
!opencl.kernels = !{!0}
!llvm.ident = !{!1}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
@foo}
!1 = metadata !{metadata !"clang version 3.4 (trunk)"}
!2 = metadata !{metadata !3, metadata !3, i64 0}
!3 = metadata !{metadata !"int", metadata !4, i64 0}
!4 = metadata !{metadata !"omnipotent char", metadata !5, i64
0}
!5 = metadata !{metadata !"Simple C/C++ TBAA"}

Neil Henning
neil@codeplay.com
How to enable your language on GPUs
●

IOC Output

; ModuleID = 'ex.bc'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
target triple = "spir-unknown-unknown"
define spir_kernel void @foo(i32 addrspace(1)* %a, i32
addrspace(1)* %b) nounwind {
%1 = alloca i32 addrspace(1)*, align 4
%2 = alloca i32 addrspace(1)*, align 4
store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4
store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4
%3 = load i32 addrspace(1)** %1, align 4
%4 = load i32 addrspace(1)* %3, align 4
%5 = load i32 addrspace(1)** %2, align 4
store i32 %4, i32 addrspace(1)* %5, align 4
ret void
}

!opencl.kernels = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!opencl.spir.version = !{!6}
!opencl.ocl.version = !{!7}
!opencl.used.extensions = !{!8}
!opencl.used.optional.core.features = !{!8}
!opencl.compiler.options = !{!8}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
@foo, metadata !1, metadata !2, metadata !3, metadata !4,
metadata !5}
!1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32
1}
!2 = metadata !{metadata !"kernel_arg_access_qual", metadata
!"none", metadata !"none"}
!3 = metadata !{metadata !"kernel_arg_type", metadata !"int*",
metadata !"int*"}
!4 = metadata !{metadata !"kernel_arg_type_qual", metadata
!"", metadata !""}
!5 = metadata !{metadata !"kernel_arg_name", metadata !"a",
metadata !"b"}
!6 = metadata !{i32 1, i32 0}
!7 = metadata !{i32 0, i32 0}
!8 = metadata !{}
Neil Henning
neil@codeplay.com
How to enable your language on GPUs
●

IOC Output

; ModuleID = 'ex.cl'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
target triple = "spir-unknown-unknown"
define spir_kernel void @foo(i32 addrspace(1)* %a, i32
addrspace(1)* %b) nounwind {
%1 = alloca i32 addrspace(1)*, align 4
%2 = alloca i32 addrspace(1)*, align 4
store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4
store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4
%3 = load i32 addrspace(1)** %1, align 4
%4 = load i32 addrspace(1)* %3, align 4
%5 = load i32 addrspace(1)** %2, align 4
store i32 %4, i32 addrspace(1)* %5, align 4
ret void
}

!opencl.kernels = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!opencl.spir.version = !{!6}
!opencl.ocl.version = !{!7}
!opencl.used.extensions = !{!8}
!opencl.used.optional.core.features = !{!8}
!opencl.compiler.options = !{!8}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
@foo, metadata !1, metadata !2, metadata !3, metadata !4,
metadata !5}
!1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32
1}
!2 = metadata !{metadata !"kernel_arg_access_qual", metadata
!"none", metadata !"none"}
!3 = metadata !{metadata !"kernel_arg_type", metadata !"int*",
metadata !"int*"}
!4 = metadata !{metadata !"kernel_arg_type_qual", metadata
!"", metadata !""}
!5 = metadata !{metadata !"kernel_arg_name", metadata !"a",
metadata !"b"}
!6 = metadata !{i32 1, i32 0}
!7 = metadata !{i32 0, i32 0}
!8 = metadata !{}
Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

So the metadata is different!
●

We could fix Clang to produce the right metadata…?

●

Or just hack around!

●

Lets use Intel’s compiler to generate a stub function

●

Then we can use an extern function defined in our Clang module!

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

extern int doSomething(int a);
void kernel foo(global int * in, global int * out)
{
int id = get_global_id(0);
out[id] = doSomething(in[id]);
}

int doSomething(int a)
{
return a;
}

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

And it fails…? 
●

Intel’s compiler doesn’t like extern functions!

●

We’ve already bodged it thus far…

●

So lets continue!

Int __attribute__((weak)) doSomething(int a) {}
void kernel foo(global int * in, global int * out)
{
int id = get_global_id(0);
out[id] = doSomething(in[id]);
}
Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

More than a little nasty…
●

Relies on Clang extension to declare function weak within OpenCL

●

Relies on Intel using Clang and allowing extension

●

But it works!

●

Can build both the Intel stub code & the Clang actual code

●

Then use llvm-link to pull them together!

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

So now we can compile two OpenCL kernels, link them together, and run it

●

What is next? Want to enable your language!
●

What about using Clang, but using a different language?

●

C & C++ come to mind!

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Use a simple C file

int doSomething(int a)
{
return a;
}

●

And use Clang to compile it

clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.c –o foo.bc

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Or a simple C++ file!

extern “C” int doSomething(int a);
template<typename T> T templatedSomething(const T t)
{
return t;
}
int doSomething(int a)
{
return templatedSomething(a);
}
Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Lets have some real C++ code

●

Use features that OpenCL doesn’t provide us

We’ll do a matrix multiplication in C++

Use classes, constructors, templates

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

typedef float __attribute__((ext_vector_type(4))) float4;
typedef float __attribute__((ext_vector_type(16))) float16;
float __attribute__((overloadable)) dot(float4 a, float4 b);
template<typename T, unsigned int WIDTH, unsigned int HEIGHT> class Matrix
{
typedef T __attribute__((ext_vector_type(WIDTH))) RowType;
RowType rows[HEIGHT];
public:
Matrix() {}
template<typename U> Matrix(const U & u) { __builtin_memcpy(&rows, &u, sizeof(U)); }
RowType & operator[](const unsigned int index) { return rows[index]; }
const RowType & operator[](const unsigned int index) const { return rows[index]; }
};

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

template<typename T, unsigned int WIDTH, unsigned int HEIGHT>
Matrix<T, WIDTH, HEIGHT> operator *(const Matrix<T, WIDTH, HEIGHT> & a, const Matrix<T,
WIDTH, HEIGHT> & b)
{
Matrix<T, HEIGHT, WIDTH> bShuffled;
for(unsigned int h = 0; h < HEIGHT; h++)
for(unsigned int w = 0; w < WIDTH; w++)
bShuffled[w][h] = b[h][w];
Matrix<T, WIDTH, HEIGHT> result;
for(unsigned int h = 0; h < HEIGHT; h++)
for(unsigned int w = 0; w < WIDTH; w++)
result[h][w] = dot(a[h], bShuffled[w]);
return result;
}

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

extern “C” float16 doSomething(float16 a, float16 b);
float16 doSomething(float16 a, float16 b)
{
Matrix<float, 4, 4> matA(a);
Matrix<float, 4, 4> matB(b);
Matrix<float, 4, 4> mul = matA * matB;
float16 result = (float16 )0;
result.s0123 = mul[0];
result.s4567 = mul[1];
result.s89ab = mul[2];
result.scdef = mul[3];
return result;
}
Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

And when we run it…

ex5.vcxproj -> E:AMDDeveloperSummit2013buildExample5Debugex5.exe
Found 2 platforms!
Choosing vendor 'Intel(R) Corporation'!
Found 1 devices!
SPIR file length '3948' bytes!
[ 0.0, 1.0, 2.0, 3.0] * [ 16.0, 15.0, 14.0, 13.0] = [ 40.0, 34.0, 28.0, 22.0]
[ 4.0, 5.0, 6.0, 7.0] * [ 12.0, 11.0, 10.0, 9.0] = [200.0, 178.0, 156.0, 134.0]
[ 8.0, 9.0, 10.0, 11.0] * [ 8.0, 7.0, 6.0, 5.0] = [360.0, 322.0, 284.0, 246.0]
[ 12.0, 13.0, 14.0, 15.0] * [ 4.0, 3.0, 2.0, 1.0] = [520.0, 466.0, 412.0, 358.0]
●

Success!
Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

The least you need to target a GPU;
●

Generate correct LLVM IR with SPIR

metadata
●

Or at least generate LLVM IR and

use the approach we used to
combine Clang and IOC generated
kernels

!opencl.kernels = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!opencl.spir.version = !{!6}
!opencl.ocl.version = !{!7}
!opencl.used.extensions = !{!8}
!opencl.used.optional.core.features = !{!8}
!opencl.compiler.options = !{!8}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
@foo, metadata !1, metadata !2, metadata !3, metadata !4,
metadata !5}
!1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32
1}
!2 = metadata !{metadata !"kernel_arg_access_qual", metadata
!"none", metadata !"none"}
!3 = metadata !{metadata !"kernel_arg_type", metadata !"int*",
metadata !"int*"}
!4 = metadata !{metadata !"kernel_arg_type_qual", metadata
!"", metadata !""}
!5 = metadata !{metadata !"kernel_arg_name", metadata !"a",
metadata !"b"}
!6 = metadata !{i32 1, i32 0}
!7 = metadata !{i32 0, i32 0}
!8 = metadata !{}
Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

Porting C/C++ libraries to SPIR requires a little more work

int foo(int * a)
{
return *a;
}
●

The data pointed to by ‘a’ will by default be put in the private address space
●

But a straight conversion to SPIR needs all data in global address space

●

Means that any porting of existing code could be quite intrusive

Neil Henning
neil@codeplay.com
How to enable your language on GPUs

●

To target your language at GPUs
●

●

Need to be able to segregate work into parallel chunks

●

Have to ban certain features that don’t work with compute

●

●

Need to deal with distinct address spaces

Language could also provide an API onto OpenCL SPIR builtins

But with OpenCL SPIR it is now possible to make any language work on a GPU!

Neil Henning
neil@codeplay.com
Developing tools for GPUs

Neil Henning
neil@codeplay.com
Developing tools for GPUs

●

Tools increasingly required to support development

●

Even having printf (which OpenCL 1.2 added) is novel!

●

But with increasingly complex code better tools needed

●

Main three are debuggers, profilers and compiler-tools

Neil Henning
neil@codeplay.com
Developing tools for GPUs

●

Debuggers for compute are difficult for non-vendor to develop

●

Codeplay has developed such tools on top of compute standards

●

Problem is bedrock for these tools can change at any time

●

Hard to beat vendor-owned approach that has lower-level access

Neil Henning
neil@codeplay.com
Developing tools for GPUs

Our
Language

●

Codeplay are pushing hard for HSA to have features

that aid tool development
●

Debuggers are much easier with instruction

support, debug info, change registers, call stacks

Our
Language

●

OpenCL SPIR harder to create debugger for without

vendor support
●

Can we standardize a way to debug OpenCL SPIR,

or allow debugging via emulation of SPIR?
Neil Henning
neil@codeplay.com
Developing tools for GPUs

●

Profilers require superset of debugger feature-set

●

Need to be able to trap kernels at defined points

●

Accurate timings only other requirement beyond debugger support

●

More fun when we go beyond performance, and measure power

Neil Henning
neil@codeplay.com
Developing tools for GPUs

●

HSA and OpenCL SPIR both good profiler targets

●

Could split SPIR kernels into profiling sections

●

Then use existing timing information in OpenCL

●

HSA will only require debugger features we are pushing for

Neil Henning
neil@codeplay.com
Developing tools for GPUs

●

Compiler tools consist of optimizers and analysis

●

Both HSA and OpenCL SPIR being based on LLVM enable this!

●

We as compiler experts can aid existing runtimes

●

You as developers can add optimizations & analyse your kernels!

Neil Henning
neil@codeplay.com
Conclusion

Neil Henning
neil@codeplay.com
Conclusion

●

With the rise of open standards, compute is increasingly easy

●

With HSA & OpenCL SPIR hardware is finally open to us!

●

Just need standards to ratify, mature & be available on hardware!

●

Next big push into compute is upon us

Neil Henning
neil@codeplay.com
Questions?
Can also catch me on twitter @sheredom

Neil Henning
neil@codeplay.com
Resources

●

SPIR extension on Khronos website

●

http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/cl_khr_spir.html

●

SPIR provisional specification

●

http://www.khronos.org/files/opencl-spir-12-provisional.pdf

●

HSA Foundation

●

http://hsafoundation.com/

Neil Henning
neil@codeplay.com

Contenu connexe

Tendances

PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...AMD Developer Central
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...AMD Developer Central
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime HSA Foundation
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...AMD Developer Central
 
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderPT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderAMD Developer Central
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandAMD Developer Central
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauAMD Developer Central
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...AMD Developer Central
 

Tendances (20)

PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
 
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderPT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
 
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 

En vedette

CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...AMD Developer Central
 
Softkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Softkinetic user interface evolution by Ilse Ravyse and Tanya VarbanoveSoftkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Softkinetic user interface evolution by Ilse Ravyse and Tanya VarbanoveIndustrial Design Center
 
iMinds & SME Innovation
iMinds & SME InnovationiMinds & SME Innovation
iMinds & SME Innovationimec
 

En vedette (6)

TRENDS: What You Need To Know From CES 2012
TRENDS: What You Need To Know From CES 2012TRENDS: What You Need To Know From CES 2012
TRENDS: What You Need To Know From CES 2012
 
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
 
Soft kinetic identity
Soft kinetic identitySoft kinetic identity
Soft kinetic identity
 
Lékué history
Lékué historyLékué history
Lékué history
 
Softkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Softkinetic user interface evolution by Ilse Ravyse and Tanya VarbanoveSoftkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Softkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
 
iMinds & SME Innovation
iMinds & SME InnovationiMinds & SME Innovation
iMinds & SME Innovation
 

Similaire à PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8Phil Eaton
 
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...CODE BLUE
 
Python for PHP developers
Python for PHP developersPython for PHP developers
Python for PHP developersbennuttall
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...Sang Don Kim
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceNeo4j
 
MathWorks Interview Lecture
MathWorks Interview LectureMathWorks Interview Lecture
MathWorks Interview LectureJohn Yates
 
Efficient Image Processing - Nicolas Roard
Efficient Image Processing - Nicolas RoardEfficient Image Processing - Nicolas Roard
Efficient Image Processing - Nicolas RoardParis Android User Group
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonTakeshi Akutsu
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of pythonYung-Yu Chen
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Windows Developer
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentDavid Galeano
 
Verilog Lecture1
Verilog Lecture1Verilog Lecture1
Verilog Lecture1Béo Tú
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...InfluxData
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirHideki Takase
 
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native  benchmark test su dispositivi x86: java, ndk, ipp e tbbGo native  benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbbJooinK
 

Similaire à PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning (20)

AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8
 
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
 
Python for PHP developers
Python for PHP developersPython for PHP developers
Python for PHP developers
 
College1
College1College1
College1
 
Return of c++
Return of c++Return of c++
Return of c++
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
 
MathWorks Interview Lecture
MathWorks Interview LectureMathWorks Interview Lecture
MathWorks Interview Lecture
 
Efficient Image Processing - Nicolas Roard
Efficient Image Processing - Nicolas RoardEfficient Image Processing - Nicolas Roard
Efficient Image Processing - Nicolas Roard
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 
Verilog Lecture1
Verilog Lecture1Verilog Lecture1
Verilog Lecture1
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
 
Software Engineering
Software EngineeringSoftware Engineering
Software Engineering
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native  benchmark test su dispositivi x86: java, ndk, ipp e tbbGo native  benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb
 

Plus de AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 

Plus de AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

  • 1. Adapting Languages for Parallel Processing on GPUs Neil Henning – Technology Lead Neil Henning neil@codeplay.com
  • 2. Agenda ● Introduction ● Current landscape ● What is wrong with the current landscape ● How to enable your language on GPUs ● Developing tools for GPUs Neil Henning neil@codeplay.com
  • 4. Introduction – who am I? ● Five years in the industry ● Spent all of that using SPUs, GPUs, vectors units & DSPs ● Last two years focused on open standards (mostly OpenCL) ● Passionate about making compute easy Neil Henning neil@codeplay.com
  • 5. Introduction – who are we? ● GPU Compiler Experts based out of Edinburgh, Scotland ● 35 employees working on contracts, R&D and internal tech Neil Henning neil@codeplay.com
  • 7. Current Landscape ● Languages – CUDA, RenderScript, C++AMP & OpenCL ● Targets – GPU (mobile & desktop), CPU (scalar & vector), DSPs, FPGAs ● Concerns – performance, power, precision, parallelism & portability Neil Henning neil@codeplay.com
  • 8. Current Landscape - CUDA __global__ void kernel(char * a, char * b) { a[blockIdx.x] = b[blockIdx.x]; } char in[SIZE], out[SIZE]; char * cIn, * cOut; cudaMalloc((void **)&cIn, SIZE); cudaMalloc((void **)&cOut, SIZE); cudaMemcpy(cIn, in, size, cudaMemcpyHostToDevice); kernel<<<SIZE, 1>>>(cOut, cIn); cudaMemcpy(out, cOut, size, cudaMemcpyDeviceToHost); cudaFree(cIn); cudaFree(cOut); ● CUDA incredibly established ● ● First major GPU compute approach to market majority of devices ● Huge bank of tools, libraries and knowledge ● Really only had uptake in offline processing ● Used in banking, medical imaging, game asset ● Standard isn’t open, little room (or enthusiasm) for creation, and many many more uses! Using CUDA means abandoning compute on other vendors to implement Neil Henning neil@codeplay.com
  • 9. Current Landscape - RenderScript #pragma version(1) #pragma rs java_package_name(foo) rs_allocation gIn; rs_allocation gOut; rs_script gScript; void root(const char * in, char * out, const void * usr, uint32_t x, uint32_t y) { *out = *in; } void filter() { rsForEach(gScript, gIn, gOut, NULL); } Context ctxt = /* … */; RenderScript rs = RenderScript.create(ctxt); ScriptC_foo script = new ScriptC_foo(rs, getResources(), R.raw.foo); Allocation in = Allocation.createSized(rs, Element.I8(rs), SIZE); Allocation out = Allocation.createSized(rs, Element.I8(rs), SIZE); script.set_gIn(in); script.set_gOut(out); script.set_gScript(script); script.invoke_filter(); ● Intelligent runtime load balances kernels ● Only on Android ● Creates Java classes to interface with kernels ● Limited documentation & shortage of examples ● Focused on performance portability ● No real idea of feature roadmap Neil Henning neil@codeplay.com
  • 10. Current Landscape – C++AMP int in[SIZE], out[SIZE]; array_view<const int, 1> aIn(SIZE, in); array_view<int, 1> aOut(SIZE, out); aOut.discard_data(); parallel_for_each(aOut.extent, [=](index<1> idx) restrict(amp) { aOut[idx] = aIn[idx]; } ); ● Very well thought out single source approach ● Lovely use of C++ templates to capture type information, array dimensions ● Great use of C++11 Lambda’s for capturing kernel intent ● Part of target community is really C++11 averse, need convincing Limited low-level support ● Initial interest by community faded fast ● // can access aOut[…] like normal ● Xbox One will support C++AMP – watch this space  Neil Henning neil@codeplay.com
  • 11. Current Landscape - OpenCL void kernel foo(global int * a, global int * b) { int idx = get_global_id(0); a[idx] = b[idx]; } // device, context, queue, in, out already created cl_program program = clCreateProgramWithSource(context, 1, fooAsStr, NULL, NULL); clBuildProgram(program, 1, &device, NULL, NULL, NULL); cl_kernel kernel = clCreateKernel(program, “foo”, NULL); // set kernel arguments clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 0, NULL, NULL); ● Open standard with many contributors ● API is verbose, very very verbose! ● API puts control in developer hands ● Steep learning curve for new developers ● Support on lots of heterogeneous platforms – not just GPUs! ● Have to support diverse range of application types Neil Henning neil@codeplay.com
  • 12. Current Landscape Modern systems have many compute-capable devices in them Not unlike the fictitious system shown above! Neil Henning neil@codeplay.com
  • 13. Current Landscape Scalar CPUs are the ‘normal’ target for programmers, easy to target, easy to use Mostly a fallback target for compute currently Neil Henning neil@codeplay.com
  • 14. Current Landscape Scalar CPUs are the ‘normal’ target for programmers, easy to target, easy to use Mostly a fallback target for compute currently Vector units are supported if kernel has vector types Can auto-vectorize user kernels, as vector units harder for ‘normal’ programmers to target Neil Henning neil@codeplay.com
  • 15. Current Landscape Scalar CPUs are the ‘normal’ target for programmers, easy to target, easy to use Mostly a fallback target for compute currently Vector units are supported if kernel has vector types Can auto-vectorize user kernels, as vector units harder for ‘normal’ programmers to target Can make no assumptions as to what DSPs ‘look’ like Digital Signal Processors (DSPs) are a future target for the compute market Neil Henning neil@codeplay.com
  • 16. Current Landscape Scalar CPUs are the ‘normal’ target for programmers, easy to target, easy to use Mostly a fallback target for compute currently Vector units are supported if kernel has vector types Can auto-vectorize user kernels, as vector units harder for ‘normal’ programmers to target GPUs do not forgive poor code like a CPU or even a DSP could, require large arrays of work to utilize GPUs are the reason we have compute in the first place Can make no assumptions as to what they ‘look’ like Digital Signal Processors (DSPs) are a future target for the compute market Neil Henning neil@codeplay.com
  • 17. Current Landscape ● ● Have to weigh up many competing concerns for languages Platform, operating system, device type, battery life, use case Neil Henning neil@codeplay.com
  • 18. What is wrong with the current landscape Neil Henning neil@codeplay.com
  • 19. What is wrong with the current landscape ● Compute approaches are not on all device and OS combinations ● No CUDA on AMD, RenderScript on iOS or C++AMP on Linux ● Have to support offline precise compute & time-bound online compute ● Very divergent targets/use cases/device types is problematic! Neil Henning neil@codeplay.com
  • 20. What is wrong with the current landscape ● What if loop count is always multiple of four? void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); ++idx) { a[idx] = 42 * b[idx]; } } Neil Henning neil@codeplay.com
  • 21. What is wrong with the current landscape ● void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); idx += 4) { a[idx + 0] = 42 * b[idx + 0]; a[idx + 1] = 42 * b[idx + 1]; a[idx + 2] = 42 * b[idx + 2]; a[idx + 3] = 42 * b[idx + 3]; } } What if loop count is always multiple of four? ● Can unroll the loop four times! Neil Henning neil@codeplay.com
  • 22. What is wrong with the current landscape ● void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); idx += 4) { a[idx + 0] = 42 * b[idx + 0]; a[idx + 1] = 42 * b[idx + 1]; a[idx + 2] = 42 * b[idx + 2]; a[idx + 3] = 42 * b[idx + 3]; } } What if loop count is always multiple of four? ● Can unroll the loop four times! ● What if pointers a & b are sixteen byte aligned? Neil Henning neil@codeplay.com
  • 23. What is wrong with the current landscape ● What if loop count is always multiple of four? ● Can unroll the loop four times! ● What if pointers a & b are sixteen byte aligned? ● void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; Can vectorize the loop body! for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } } Neil Henning neil@codeplay.com
  • 24. What is wrong with the current landscape for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } ● What if loop count is always multiple of four? ● Can unroll the loop four times! ● What if pointers a & b are sixteen byte aligned? ● void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; Can vectorize the loop body! ● Why does my code look so radically different now? } Neil Henning neil@codeplay.com
  • 25. What is wrong with the current landscape for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } ● What if loop count is always multiple of four? ● Can unroll the loop four times! ● What if pointers a & b are sixteen byte aligned? ● void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; Can vectorize the loop body! ● Why does my code look so radically different now? ● Current languages force drastic developer interventions } Neil Henning neil@codeplay.com
  • 26. What is wrong with the current landscape void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } ● Existing languages (mostly) force developers to do coding wizardry that is unnecessary ● Also no real feedback to developer as ‘main’ compute target has highly secretive ISAs ● Don’t want to force vendors to reveal secrets, but do want ability to influence kernel code generation } Neil Henning neil@codeplay.com
  • 27. What is wrong with the current landscape ● Rely on vendors to provide tools to aid development ● Debuggers, profilers, static analysis all increasingly required ● Libraries can vastly decrease development time ● Rely solely on vendors to provide all these complicated pieces Neil Henning neil@codeplay.com
  • 28. What is wrong with the current landscape ● Vendors already have lots of targets to support ● Every generation of devices need to test conformance ● Need to support compilers, graphics, compute, tools, list goes on! ● Why should the vendor be the only one taking the burden? Neil Henning neil@codeplay.com
  • 29. What is wrong with the current landscape ● No one can agree on what is the ‘best’ approach ● Personal preference of developer/organization sways opinions ● Why not allow Lisp on a GPU? Lua on a DSP? ● Vendor doesn’t need extra headache of supporting these niche use cases Neil Henning neil@codeplay.com
  • 30. What is wrong with the current landscape ● My pitch – let community support compute standards ● Take the approach of LLVM & Clang ● Vendor has to support lower standard on their hardware ● But allows community to support & innovate Neil Henning neil@codeplay.com
  • 31. How to enable your language on GPUs Neil Henning neil@codeplay.com
  • 32. How to enable your language on GPUs ● First step – be able to compile language to a binary ● Can’t output real binary though ● Vendor doesn’t want to expose ISA ● Developer wants portability of compiled kernels Neil Henning neil@codeplay.com
  • 33. How to enable your language on GPUs ● Need to use an Intermediate Representation (IR) ● Two approaches in development for this! ● HSA Intermediate Language (HSAIL) ● OpenCL Standard Portable Intermediate Representation (SPIR) Neil Henning neil@codeplay.com
  • 34. How to enable your language on GPUs Our Language Our Language ● Language -> LLVM IR -> HSAIL ● Language -> LLVM IR -> SPIR ● Low level mapping onto hardware, more of a virtual ISA ● Then pass SPIR to OpenCL runtime as binary ● Execute like normal OpenCL C Language kernel ● Provisional specification available! than an IR ● HSAIL heavily in development Neil Henning neil@codeplay.com
  • 35. How to enable your language on GPUs Our Language ● HSA will provide a low-level runtime to interface between HSA compiled binaries and OS Our Language ● OpenCL SPIR will require a SPIR compliant OpenCL implementation as target ● HSAIL is being standardized and ratified ● Can compile using LLVM, then use ● Existing JIT’ed languages potential targets clCreateProgramWithBinary, passing SPIR options Neil Henning neil@codeplay.com
  • 36. How to enable your language on GPUs ● At present, SPIR is only target we can investigate ● Intel has OpenCL drivers with provisional SPIR support ● Can use Clang -> LLVM -> SPIR, then use Intel’s OpenCL to consume SPIR ● Can take code that compiles to LLVM and run it on OpenCL Neil Henning neil@codeplay.com
  • 37. How to enable your language on GPUs ● Various steps to getting your language working on GPUs with SPIR ● We’ll use Intel’s OpenCL SDK with provisional SPIR support; 1. Create a test harness to load a SPIR binary 2. Create a simple kernel using Intel’s SPIR compiler on host 3. Create a simple kernel using tip Clang (language OpenCL) targeting SPIR 4. Try other languages that compile to LLVM with SPIR target Neil Henning neil@codeplay.com
  • 38. How to enable your language on GPUs // some SPIR bitcode file const unsigned char spir_bc[spir_bc_length]; // already initialized platform, device & context for a SPIR compliant device cl_platform_id platform = ... ; cl_device device = ... ; cl_context context = … ; // create our program with our SPIR bitcode file cl_program program = clCreateProgramWithBinary( context, 1, &device, &spir_bc_length, &spir_bc, NULL, NULL); // build, passing arguments telling the compiler language is SPIR, and the SPIR standard we are using clBuildProgram(program, 1, &device, “–x spir –spir–std=1.2”, NULL, NULL); Neil Henning neil@codeplay.com
  • 39. How to enable your language on GPUs // already initialized memory buffers for our context cl_mem in_mem = ... ; cl_mem out_mem = ... ; // assume our kernel function from the spir kernel was called foo cl_kernel kernel = clCreateKernel(program, “foo”, NULL); // assume our kernel has one read buffer as first argument, and one write buffer as second clSetKernelArg(kernel, 0, sizeof(cl_mem), (void * )&in_mem); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void * )&out_mem); Neil Henning neil@codeplay.com
  • 40. How to enable your language on GPUs // already initialized command queue cl_command_queue queue = … ; cl_event write_event, run_event; clEnqueueWriteBuffer(queue, in_mem, CL_FALSE, 0, BUFFER_SIZE, &read_payload, 0, NULL, &write_event); const size_t size = BUFFER_SIZE / sizeof(cl_int); clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 1, &write_event, &run_event); clEnqueueReadBuffer(queue, out_mem, CL_TRUE, 0, BUFFER_SIZE, &result_payload, 1, &run_event, NULL); Neil Henning neil@codeplay.com
  • 41. How to enable your language on GPUs ● Now, create a simple OpenCL kernel void kernel foo(global int * in, global int * out) { out[get_global_id(0)] = in[get_global_id(0)]; } ● And use Intel’s command line (or GUI!) tool to build Ioc32 –cmd=build –input foo.cl –spir32=foo.bc Neil Henning neil@codeplay.com
  • 42. How to enable your language on GPUs ● Next we point the buffer for our SPIR kernel at the generated SPIR kernel ● And it fails…? ● Turns out Intel’s OpenCL runtime doesn’t like us telling them they are building SPIR! ● Simply remove “–x spir –spir–std=1.2” from the build options and voila! Neil Henning neil@codeplay.com
  • 43. How to enable your language on GPUs ● Next step – use tip Clang to build our foo.cl kernel clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.cl –o foo.bc ● Compiles ok, but when we run it fails…? ● So Clang generated SPIR bitcode file could very well not work ● We’ll take a look at the readable IR for the Intel & Clang compiled kernels Neil Henning neil@codeplay.com
  • 44. How to enable your language on GPUs ● Clang Output ; ModuleID = 'ex.cl' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" ; Function Attrs: nounwind define void @foo(i32 addrspace(1)* nocapture readonly %a, i32 addrspace(1)* nocapture %b) #0 { entry: %0 = load i32 addrspace(1)* %a, align 4, !tbaa !2 store i32 %0, i32 addrspace(1)* %b, align 4, !tbaa !2 ret void } attributes #0 = { nounwind "less-precise-fpmad"="false" "noframe-pointer-elim"="false" "no-infs-fp-math"="false" "no-nansfp-math"="false" "no-realign-stack" "stack-protector-buffersize"="8" "unsafe-fp-math"="false" "use-soft-float"="false" } !opencl.kernels = !{!0} !llvm.ident = !{!1} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo} !1 = metadata !{metadata !"clang version 3.4 (trunk)"} !2 = metadata !{metadata !3, metadata !3, i64 0} !3 = metadata !{metadata !"int", metadata !4, i64 0} !4 = metadata !{metadata !"omnipotent char", metadata !5, i64 0} !5 = metadata !{metadata !"Simple C/C++ TBAA"} Neil Henning neil@codeplay.com
  • 45. How to enable your language on GPUs ● IOC Output ; ModuleID = 'ex.bc' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" define spir_kernel void @foo(i32 addrspace(1)* %a, i32 addrspace(1)* %b) nounwind { %1 = alloca i32 addrspace(1)*, align 4 %2 = alloca i32 addrspace(1)*, align 4 store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4 store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4 %3 = load i32 addrspace(1)** %1, align 4 %4 = load i32 addrspace(1)* %3, align 4 %5 = load i32 addrspace(1)** %2, align 4 store i32 %4, i32 addrspace(1)* %5, align 4 ret void } !opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{} Neil Henning neil@codeplay.com
  • 46. How to enable your language on GPUs ● IOC Output ; ModuleID = 'ex.cl' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" define spir_kernel void @foo(i32 addrspace(1)* %a, i32 addrspace(1)* %b) nounwind { %1 = alloca i32 addrspace(1)*, align 4 %2 = alloca i32 addrspace(1)*, align 4 store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4 store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4 %3 = load i32 addrspace(1)** %1, align 4 %4 = load i32 addrspace(1)* %3, align 4 %5 = load i32 addrspace(1)** %2, align 4 store i32 %4, i32 addrspace(1)* %5, align 4 ret void } !opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{} Neil Henning neil@codeplay.com
  • 47. How to enable your language on GPUs ● So the metadata is different! ● We could fix Clang to produce the right metadata…? ● Or just hack around! ● Lets use Intel’s compiler to generate a stub function ● Then we can use an extern function defined in our Clang module! Neil Henning neil@codeplay.com
  • 48. How to enable your language on GPUs extern int doSomething(int a); void kernel foo(global int * in, global int * out) { int id = get_global_id(0); out[id] = doSomething(in[id]); } int doSomething(int a) { return a; } Neil Henning neil@codeplay.com
  • 49. How to enable your language on GPUs ● And it fails…?  ● Intel’s compiler doesn’t like extern functions! ● We’ve already bodged it thus far… ● So lets continue! Int __attribute__((weak)) doSomething(int a) {} void kernel foo(global int * in, global int * out) { int id = get_global_id(0); out[id] = doSomething(in[id]); } Neil Henning neil@codeplay.com
  • 50. How to enable your language on GPUs ● More than a little nasty… ● Relies on Clang extension to declare function weak within OpenCL ● Relies on Intel using Clang and allowing extension ● But it works! ● Can build both the Intel stub code & the Clang actual code ● Then use llvm-link to pull them together! Neil Henning neil@codeplay.com
  • 51. How to enable your language on GPUs ● So now we can compile two OpenCL kernels, link them together, and run it ● What is next? Want to enable your language! ● What about using Clang, but using a different language? ● C & C++ come to mind! Neil Henning neil@codeplay.com
  • 52. How to enable your language on GPUs ● Use a simple C file int doSomething(int a) { return a; } ● And use Clang to compile it clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.c –o foo.bc Neil Henning neil@codeplay.com
  • 53. How to enable your language on GPUs ● Or a simple C++ file! extern “C” int doSomething(int a); template<typename T> T templatedSomething(const T t) { return t; } int doSomething(int a) { return templatedSomething(a); } Neil Henning neil@codeplay.com
  • 54. How to enable your language on GPUs ● Lets have some real C++ code ● Use features that OpenCL doesn’t provide us We’ll do a matrix multiplication in C++ Use classes, constructors, templates Neil Henning neil@codeplay.com
  • 55. How to enable your language on GPUs typedef float __attribute__((ext_vector_type(4))) float4; typedef float __attribute__((ext_vector_type(16))) float16; float __attribute__((overloadable)) dot(float4 a, float4 b); template<typename T, unsigned int WIDTH, unsigned int HEIGHT> class Matrix { typedef T __attribute__((ext_vector_type(WIDTH))) RowType; RowType rows[HEIGHT]; public: Matrix() {} template<typename U> Matrix(const U & u) { __builtin_memcpy(&rows, &u, sizeof(U)); } RowType & operator[](const unsigned int index) { return rows[index]; } const RowType & operator[](const unsigned int index) const { return rows[index]; } }; Neil Henning neil@codeplay.com
  • 56. How to enable your language on GPUs template<typename T, unsigned int WIDTH, unsigned int HEIGHT> Matrix<T, WIDTH, HEIGHT> operator *(const Matrix<T, WIDTH, HEIGHT> & a, const Matrix<T, WIDTH, HEIGHT> & b) { Matrix<T, HEIGHT, WIDTH> bShuffled; for(unsigned int h = 0; h < HEIGHT; h++) for(unsigned int w = 0; w < WIDTH; w++) bShuffled[w][h] = b[h][w]; Matrix<T, WIDTH, HEIGHT> result; for(unsigned int h = 0; h < HEIGHT; h++) for(unsigned int w = 0; w < WIDTH; w++) result[h][w] = dot(a[h], bShuffled[w]); return result; } Neil Henning neil@codeplay.com
  • 57. How to enable your language on GPUs extern “C” float16 doSomething(float16 a, float16 b); float16 doSomething(float16 a, float16 b) { Matrix<float, 4, 4> matA(a); Matrix<float, 4, 4> matB(b); Matrix<float, 4, 4> mul = matA * matB; float16 result = (float16 )0; result.s0123 = mul[0]; result.s4567 = mul[1]; result.s89ab = mul[2]; result.scdef = mul[3]; return result; } Neil Henning neil@codeplay.com
  • 58. How to enable your language on GPUs ● And when we run it… ex5.vcxproj -> E:AMDDeveloperSummit2013buildExample5Debugex5.exe Found 2 platforms! Choosing vendor 'Intel(R) Corporation'! Found 1 devices! SPIR file length '3948' bytes! [ 0.0, 1.0, 2.0, 3.0] * [ 16.0, 15.0, 14.0, 13.0] = [ 40.0, 34.0, 28.0, 22.0] [ 4.0, 5.0, 6.0, 7.0] * [ 12.0, 11.0, 10.0, 9.0] = [200.0, 178.0, 156.0, 134.0] [ 8.0, 9.0, 10.0, 11.0] * [ 8.0, 7.0, 6.0, 5.0] = [360.0, 322.0, 284.0, 246.0] [ 12.0, 13.0, 14.0, 15.0] * [ 4.0, 3.0, 2.0, 1.0] = [520.0, 466.0, 412.0, 358.0] ● Success! Neil Henning neil@codeplay.com
  • 59. How to enable your language on GPUs ● The least you need to target a GPU; ● Generate correct LLVM IR with SPIR metadata ● Or at least generate LLVM IR and use the approach we used to combine Clang and IOC generated kernels !opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{} Neil Henning neil@codeplay.com
  • 60. How to enable your language on GPUs ● Porting C/C++ libraries to SPIR requires a little more work int foo(int * a) { return *a; } ● The data pointed to by ‘a’ will by default be put in the private address space ● But a straight conversion to SPIR needs all data in global address space ● Means that any porting of existing code could be quite intrusive Neil Henning neil@codeplay.com
  • 61. How to enable your language on GPUs ● To target your language at GPUs ● ● Need to be able to segregate work into parallel chunks ● Have to ban certain features that don’t work with compute ● ● Need to deal with distinct address spaces Language could also provide an API onto OpenCL SPIR builtins But with OpenCL SPIR it is now possible to make any language work on a GPU! Neil Henning neil@codeplay.com
  • 62. Developing tools for GPUs Neil Henning neil@codeplay.com
  • 63. Developing tools for GPUs ● Tools increasingly required to support development ● Even having printf (which OpenCL 1.2 added) is novel! ● But with increasingly complex code better tools needed ● Main three are debuggers, profilers and compiler-tools Neil Henning neil@codeplay.com
  • 64. Developing tools for GPUs ● Debuggers for compute are difficult for non-vendor to develop ● Codeplay has developed such tools on top of compute standards ● Problem is bedrock for these tools can change at any time ● Hard to beat vendor-owned approach that has lower-level access Neil Henning neil@codeplay.com
  • 65. Developing tools for GPUs Our Language ● Codeplay are pushing hard for HSA to have features that aid tool development ● Debuggers are much easier with instruction support, debug info, change registers, call stacks Our Language ● OpenCL SPIR harder to create debugger for without vendor support ● Can we standardize a way to debug OpenCL SPIR, or allow debugging via emulation of SPIR? Neil Henning neil@codeplay.com
  • 66. Developing tools for GPUs ● Profilers require superset of debugger feature-set ● Need to be able to trap kernels at defined points ● Accurate timings only other requirement beyond debugger support ● More fun when we go beyond performance, and measure power Neil Henning neil@codeplay.com
  • 67. Developing tools for GPUs ● HSA and OpenCL SPIR both good profiler targets ● Could split SPIR kernels into profiling sections ● Then use existing timing information in OpenCL ● HSA will only require debugger features we are pushing for Neil Henning neil@codeplay.com
  • 68. Developing tools for GPUs ● Compiler tools consist of optimizers and analysis ● Both HSA and OpenCL SPIR being based on LLVM enable this! ● We as compiler experts can aid existing runtimes ● You as developers can add optimizations & analyse your kernels! Neil Henning neil@codeplay.com
  • 70. Conclusion ● With the rise of open standards, compute is increasingly easy ● With HSA & OpenCL SPIR hardware is finally open to us! ● Just need standards to ratify, mature & be available on hardware! ● Next big push into compute is upon us Neil Henning neil@codeplay.com
  • 71. Questions? Can also catch me on twitter @sheredom Neil Henning neil@codeplay.com
  • 72. Resources ● SPIR extension on Khronos website ● http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/cl_khr_spir.html ● SPIR provisional specification ● http://www.khronos.org/files/opencl-spir-12-provisional.pdf ● HSA Foundation ● http://hsafoundation.com/ Neil Henning neil@codeplay.com