SlideShare a Scribd company logo
1 of 42
Download to read offline
HSA MEMORY MODEL
BENEDICT R. GASTER
WWW.QUALCOMM.COM
OUTLINE
u 

HSA Memory Model

u 

OpenCL 2.0
u 

u 

Has a memory model too

Obstruction-free bounded deques
u 

An example using the HSA memory model
HSA MEMORY MODEL

© Copyright 2012 HSA Foundation. All Rights Reserved.

3
TYPES OF MODELS
u 

Shared memory computers and programming languages, divide complexity
into models:
1. 

Memory model specifies safety
u 
u 

2. 

e.g. can a work-item prevent others from progressing?
This is what this section of the tutorial will focus on

Execution model specifies liveness
u 
u 

3. 

Described in Ben Sander’s tutorial section on HSAIL
e.g. can a work-item prevent others from progressing

Performance model specifies the big picture
u 

e.g. caches or branch divergence

u 

Specific to particular implementations and outside the scope of today’s tutorial
THE PROBLEM
u 

Assume all locations (a, b, …) are initialized to 0

u 

What are the values of $s2 and $s4 after execution?

Work-item 1

Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
*a = 1;
int x = *b;

mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
*b = 1;
int y = *a;

initially *a = 0 && *b = 0
© Copyright 2012 HSA Foundation. All Rights Reserved.

5
THE SOLUTION
u 

The memory model tells us:
u 

Defines the visibility of writes to memory at any given point

u 

Provides us with a set of possible executions

© Copyright 2012 HSA Foundation. All Rights Reserved.

6
WHAT MAKES A GOOD MEMORY MODEL*
u 

u 

u 

Programmability ; A good model should make it (relatively) easy to write multiwork-item programs. The model should be intuitive to most users, even to
those who have not read the details
Performance ; A good model should facilitate high-performance
implementations at reasonable power, cost, etc. It should give implementers
broad latitude in options
Portability ; A good model would be adopted widely or at least provide
backward compatibility or the ability to translate among models

* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis,
Computer Sciences Department, University of Wisconsin–Madison, Nov. 1993.

© Copyright 2012 HSA Foundation. All Rights Reserved.

7
SEQUENTIAL CONSISTENCY (SC)*
u 

Axiomatic Definition
u 

u 

u 

A single processor (core) sequential if “the result of an execution is the same as if
the operations had been executed in the order specified by the program.”
A multiprocessor sequentially consistent if “the result of any execution is the same
as if the operations of all processors (cores) were executed in some sequential
order, and the operations of each individual processor (core) appear in this
sequence in the order specified by its program.”

But HW/Compiler actually implements more relaxed models, e.g. ARMv7

* L. Lamport. How to Make a Multiprocessor Computer that Correctly
Executes Multiprocessor Programs. IEEE Transactions on Computers,
C-28(9):690–91, Sept. 1979.

© Copyright 2012 HSA Foundation. All Rights Reserved.

8
SEQUENTIAL CONSISTENCY (SC)
Work-item 0

Work-item 1

mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;

mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;

mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;

$s2 = 0 && $s4 = 1

© Copyright 2012 HSA Foundation. All Rights Reserved.

9
BUT WHAT ABOUT ACTUAL HARDWARE
u 

u 

Sequential consistency is (reasonably) easy to understand, but limits
optimizations that the compiler and hardware can perform
Many modern processors implement many reordering optimizations
u 

Store buffers (TSO*), work-items can see their own stores early

u 

Reorder buffers (XC*), work-items can see other work-items store early

*TSO
*XC

– Total Store Order as implemented by Sparc and x86
– Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno

© Copyright 2012 HSA Foundation. All Rights Reserved.

10
RELAXED CONSISTENCY (XC)
Work-item 0

Work-item 1

mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;

mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;

mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
ld_global_u32 $s2, [&b] ;
ld_global_u32 $s4, [&a] ;
st_global_u32 $s1, [&a] ;
st_global_u32 $s3, [&b] ;

$s2 = 0 && $s4 = 0

© Copyright 2012 HSA Foundation. All Rights Reserved.

11
WHAT ARE OUR 3 Ps?
u 

Programmability ; XC is really pretty hard for the programmer to reason about
what will be visible when
u 

u 

u 

many memory model experts have been known to get it wrong!

Performance ; XC is good for performance, the hardware (compiler) is free to
reorder many loads and stores, opening the door for performance and power
enhancements

Portability ; XC is very portable as it places very little constraints

© Copyright 2012 HSA Foundation. All Rights Reserved.

12
MY CHILDREN AND COMPUTER
ARCHITECTS ALL WANT
u 

To have their cake and eat it!

HSA Provides: The ability to enable
programmers to reason and cake
Put picture with kids with (relatively)
intuitive model of SC, while still achieving the
benefits of XC!

© Copyright 2012 HSA Foundation. All Rights Reserved.

13
SEQUENTIAL CONSISTENCY FOR DRF*
u 

HSA, following Java, C++11, and OpenCL 2.0 adopts SC for Data Race Free
(DRF)
u 

u 

plus some new capabilities !

(Informally) A data race occurs when two (or more) work-items access the
same memory location such that:
u 
u 

u 

At least one of the accesses is a WRITE
There are no intervening synchronization operations

SC for DRF asks:
u 
u 

Programmers to ensure programs are DRF under SC
Implementers to ensure that all executions of DRF programs on the relaxed model
are also SC executions
*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the
17th Annual International Symposium on Computer Architecture, pp. 2–14, May
1990

© Copyright 2012 HSA Foundation. All Rights Reserved.

14
HSA SUPPORTS RELEASE CONSISTENCY
u 

HSA’s memory model is based on RCSC:
u 

All ld_acq and st_rel are SC
u 
u 

u 

u 

Means coherence on all ld_acq and st_rel to a single address.
All ld_acq and st_rel are program ordered per work-item (actually: sequenceorder by language constraints)

Similar model adopted by ARMv8

HSA extends RCSC to SC for HRF*, to access the full capabilities of modern
heterogeneous systems, containing CPUs, GPUs, and DSPs, for example.
*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric
Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R. Gaster,
B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.

© Copyright 2012 HSA Foundation. All Rights Reserved.

15
MAKING RELAXED CONSISTENCY WORK
Work-item 0

Work-item 1

mov_u32 $s1, 1 ;
st_global_u32_rel $s1, [&a] ;
ld_global_u32_acq $s2, [&b] ;

mov_u32 $s3, 1 ;
st_global_u32_rel $s3, [&b] ;
ld_global_u32_acq $s4, [&a] ;

mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32_rel $s1, [&a] ;
ld_global_u32_acq $s2, [&b] ;
st_global_u32_rel $s3, [&b] ;
ld_global_u32_acq $s4, [&a] ;

$s2 = 0 && $s4 = 1

© Copyright 2012 HSA Foundation. All Rights Reserved.

16
SEQUENTIAL CONSISTENCY FOR DRF
u 

Two memory accesses participate in a data race if they
u 

access the same location

u 

at least one access is a store

u 

can occur simultaneously
u 

u 

u 

HSA: Not good enough!

i.e. appear as adjacent operations in interleaving.

A program is data-race-free if no possible execution
results in a data race.
Sequential consistency for data-race-free programs
u 

Avoid everything else
ALL ARE NOT EQUAL – OR SOME CAN
SEE BETTER THAN OTHERS
u 

Remember the HSAIL Execution Model

platform scope
device scope

group scope

wave scope

© Copyright 2012 HSA Foundation. All Rights Reserved.

18
DATA-RACE-FREE IS NOT ENOUGH
} Two ordinary memory accesses participate in SGlobal race if they
a data
} Access same location
S12
S34
} At least one is a store
t1
t2
t3
t4
} Can occur simultaneously

Not a data race…
Is it SC?
group #3-4
group #1-2
Well that depends

t1
t2
t3
t4
visibility implied by
st_global 1, [&X]
causality?
st_global_rel 0, [&flag]
atomic_cas_global_ar 1, 0, [&flag]
...
st_global_rel 0, [&flag]
atomic_cas_global_ar ,1 0, [&flag]
ld (??), [&x]
SEQUENTIAL CONSISTENCY FOR
HETEROGNEOUS-RACE-FREE
u 

Two memory accesses participate in a heterogeneous race if
u 

access the same location

u 

at least one access is a store

u 

can occur simultaneously
u 

u 

u 

u 

i.e. appear as adjacent operations in interleaving.

Are not synchronized with “enough” scope

A program is heterogeneous-race-free if no possible
execution results in a heterogeneous race.
Sequential consistency for heterogeneous-race-free
programs
u 

Avoid everything else
HSA HETEROGENEOUS RACE FREE
u 

HRF0: Basic Scope Synchronization
u 

u 

“enough” = both threads synchronize using identical scope

Recall example:
u 

Contains a heterogeneous race in HSA

HSA Conclusion:
This is t2
bad. Don’t dot3
it.

Workgroup #3-4

Workgroup #1-2

t1
st_global 1, [&X]
st_global_rel_wg 0, [&flag]
...

t4

Diff

eren
t Sc

ope

atomic_cas_global_ar_wg,1 0, [&flag]
ld (??), [&x]
HOW TO USE HSA WITH SCOPES
} Want: For performance, use smallest scope possible
} What is safe in HSA?

Use smallest scope that includes all
producers/consumers of shared data
HSA Scope Selection Guideline

Is this a valid assumption?
Implication:
Producers/consumers must be known at synchronization time
REGULAR GPGPU WORKLOADS
M
N

Generally: HSA works well with
Communicate
Communicate
Partition
Locally
Hierarchically
regular data-parallel workloads Globally

Define
Problem Space

N times
Well defined (regular) data partitioning +
Well defined (regular) synchronization pattern =
u 

Producer/consumers are always known

M times
IRREGULAR WORKLOADS
u 

HSA: example is race
u 

u 

Must upgrade wg (workgroup) -> plat (platform)

HSA memory model says:
u 

ld $s1, [&x], will see value (1)!

Workgroup #1-2

Workgroup #3-4

t1
t2
t3
t4
st_global 1, [&X]
st_global_rel_plat 0, [&flag]
atomic_cas_global_ar_plat 1, 0, [&flag]
...
st_global_rel_plat 0, [&flag]
atomic_cas_global_ar_plat ,1 0, [&flag]
ld $s1, [&x]
OPENCL 2.0
HAS A MEMORY MODEL TOO
MAPPING ONTO HSA’S MEMORY MODEL
OPENCL 2.0 BACKGROUND
u 

u 

u 

Provisional specification released at SIGGRAPH’13, July 2013.
Huge update to OpenCL to account for the evolving hardware landscape and
emerging use cases (e.g. irregular work loads)
Key features:
u 

Shared virtual memory, including platform atomics

u 

Formally defined memory model based on C11 plus support for scopes
u 

Includes an extended set of C1X atomic operations

u 

Generic address space, that subsumes global, local, and private

u 

Device to device enqueue

u 

Out-of-order device side queuing model

u 

Backwards compatible with OpenCL 1.x
OPENCL 1.X MEMORY MODEL MAPPING
u 

It is straightforward to provide a mapping from OpenCL 1.x to the proposed
model
OpenCL Operation

HSA Memory Model
Operation

Load

ld_global_wg
ld_group_wg

Store

st_global_wg
st_group_wg

atomic_op

atomic_op_global_comp
atomic_op_group_wg

barrier(…)

sync_wg

u 

OpenCL 1.x atomics are unordered and so map to atomic_op_X

u 

Mapping for fences not shown but straightforward
OPENCL 2.0 MEMORY MODEL MAPPING
OpenCL Operation

HSA Memory Model Operation

Load
memory_order_relaxed

atomic_ld_[global | group]_scope

Store
Memory_order_relaxed

atomic_st_[global | group]_scope

Load
memory_order_acquire

ld_[global | group]_acq_scope

Load
memory_order_seq_cst

fence_rel_scope ; ld_[global | group]_acq_scope

Store
memory_order_release

st_[global | group]_rel_scope

Store
Memory_order_seq_cst

st_[global | group]_rel_scope ; fence_acq_scope

memory_order_acq_rel

atomic_op_[global | group]_acq_rel_scope

memory_order_seq_cst

atomic_op_[global|group]_acq_rel_scope
OPENCL 2.0 MEMORY SCOPE MAPPING
OpenCL Scope

HSA Scope

memory_scope_work_group

_wg

memory_scope_device

_component

memory_scope_all_svm_devices

_platform

© Copyright 2012 HSA Foundation. All Rights Reserved.

29
OBSTRUCTION-FREE
BOUNDED DEQUES
AN EXAMPLE USING THE HSA MEMORY MODEL
CONCURRENT DATA-STRUCTURES
u 

u 

u 

Why do we need such a memory model in practice?
One important application of memory consistency is in the development and
use of concurrent data-structures
In particular, there are a class data-structures implementations that provide
non-blocking guarantees:
u 

wait-free; An algorithm is wait-free if every operation has a bound on the number of
steps the algorithm will take before the operation completes
u 

u 

lock-free; An algorithm is lock-free if every if, given enough time, at least one thread
of the work-items (or threads) makes progress
u 

u 

In practice very hard to build efficient data-structures that meet this requirement

In practice lock-free algorithms are implemented by work-item cooperating with
one enough to allow progress

Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation,
can make progress
BUT WAY NOT JUST USE MUTUAL
EXCLUSION?
Emerging Compute Cluster

??

??

Krait Krait
Diversity in a heterogeneous system, such as
CPU CPU
Adreno
Hexagon
GPU
different clock speeds, different DSP
scheduling
Krait Krait
CPU CPU
MMU
MMU can mean traditional mutual
policies, and more
MMUs
exclusion is not the L2
right choice
2MB

Fabric & Memory Controller

© Copyright 2012 HSA Foundation. All Rights Reserved.

32
CONCURRENT DATA-STRUCTURES
u 

Emerging heterogeneous compute clusters means we need:
u 
u 

u 

u 

u 

To adapt existing concurrent data-structures
Developer new concurrent data-structures

Lock based programming may still be useful but often these algorithms will
need to be lock-free
Of course, this is a key application of the HSA memory model
To showcase this we highlight the development of a well known (HLM)
obstruction-free deque*

*Herlihy, M. et al. 2003. Obstruction-free
synchronization: double-ended queues as an example.
(2003), 522–529.

© Copyright 2012 HSA Foundation. All Rights Reserved.

33
HLM - OBSTRUCTION-FREE DEQUE
u 

Uses a fixed length circular queue

u 

At any given time, reading from left to right, the array will contain:
u 
u 

Zero or more dummy-null (DN) values

u 

u 

Zero or more left-null (LN) values
Zero or more right-null (RN) values

At all times there must be:
u 

At least two different nulls values
u 

u 

At least one LN or DN, and at least one DN or RN

Memory consistency is required to allow multiple producers and multiple
consumers, potentially happening in parallel from the left and right ends, to see
changes from other work-items (HSA Components) and threads (HSA Agents)

© Copyright 2012 HSA Foundation. All Rights Reserved.

34
HLM - OBSTRUCTION-FREE DEQUE

left

LN

LN

LN

v

right

v

RN

RN

RN

Key:
LN
RN
v
left
right
© Copyright 2012 HSA Foundation. All Rights Reserved.

– left null value
– right null value
– value
– left hint index
– right hint index
35
C REPRESENTATION OF DEQUE
struct node {
uint64_t type : 2;

// null type (LN, RN, DN)

uint64_t counter : 8 ;

// version counter to avoid ABA

uint64_t value : 54 ;

// index value stored in queue

}
struct queue {
unsigned int size;
node * array;

// size of bounded buffer
// backing store for deque itself

}

© Copyright 2012 HSA Foundation. All Rights Reserved.

36
HSAIL REPRESENTATION
u 

Allocate a deque in global memory using HSAIL

@deque_instance:
align 64 global_u32 &size;
align 8 global_u64 &array;

© Copyright 2012 HSA Foundation. All Rights Reserved.

37
ORACLE
u 

Assume a function:
function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);

u 

Which given a deque
u 

returns (%k) the position of the left most of RN
u 

ld_global_acq used to read node from array

u 

Makes one if necessary (i.e. if there are only LN or DN)

u 

atomic_cas_global_ar, required to make new RN

u 

returns (%left) the left node (i.e. the value to the left of the left most RN position)

u 

returns (%right) the right node (i.e. the value at position (%k))

© Copyright 2012 HSA Foundation. All Rights Reserved.

38
RIGHT POP
function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) {
// load queue address
ld_arg_u64 $d0, [%deque];
@loop_forever:
// setup and call right oracle to get next RN
arg_u32 %k; arg_u64 %current; arg_u64 %next;
call &rcheck_oracle(%queue) ;
ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next];
// current.value($d5)
shr_u64 $d5, $d1, 62;
// current.counter($d6)
and_u64 $d6, $d1,
0x3FC0000000000000;
shr_u64 $d6, $d6, 54;
// current.value($d7)
and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF;
// next.counter($d8)
and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54;
brn @loop_forever ;
}

© Copyright 2012 HSA Foundation. All Rights Reserved.

39
RIGHT POP – TEST FOR EMPTY
// current.type($d5) == LN || current.type($d5) == DN
cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;
or_b1 $c0, $c0, $c1;
cbr $c0, @not_empty ;
// current node index (%deque($d0) + (%k($s1) - 1) * 16)
add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;
ld_global_acq_u64 $d4, [$d3];
cmp_neq_b1_u64 $c0, $d4, $d1;
cbr $c0, @not_empty;
st_arg_u32 EMPTY, [&err];
// deque empty so return EMPTY
%ret
@not_empty:

© Copyright 2012 HSA Foundation. All Rights Reserved.

40
RIGHT POP – TRY READ/REMOVE NODE
// $d9 = (RN, next.cnt+1, 0)
add_u64 $d8, $d8, 1;
shl_u64 $d9, RN, 62;
and_u64 $d8, $d8, $d9;
// cas(deq+k, next, node(RN, next.cnt+1, 0))
atomic_cas_global_ar_u64 $d9, [$s0], $d2, $d9;
cmp_neq_u64 $c0, $d9, $d2;
cbr $c0, @cas_failed;
// $d9 = (RN, current.cnt+1, 0)
add_u64 $d6, $d6, 1;
shl_u64 $d9, RN, 62;
and_u64 $d9, $d6, $d9;
// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)
atomic_cas_global_ar_u64 $d9, [$s1], $d1, $d9;
cmp_neq_u64 $c0, $d9, $d1;
cbr $c0, @cas_failed;
st_arg_u32 SUCCESS, [&err];
st_arg_u64 $d7, [&value];

%ret
@cas_failed:
// loop back around and try again
© Copyright 2012 HSA Foundation. All Rights Reserved.

41
TAKE AWAYS
u 

HSA provides a powerful and modern memory model
u 
u 

Defined as Release Consistency

u 

u 

Based on the well know SC for DRF
Extended with scopes as defined by HRF

OpenCL 2.0 introduces a new memory model
u 
u 

Also defined in terms of Release Consistency

u 

Also Extended with scope as defined in HRF

u 

u 

Also based on SC for DRF

Has a well defined mapping to HSA

Concurrent algorithm development for emerging heterogeneous computing
cluster can benefit from HSA and OpenCL 2.0 memory models

© Copyright 2012 HSA Foundation. All Rights Reserved.

42

More Related Content

What's hot

What's hot (20)

PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
 
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderPT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
 
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 

Similar to HSA-4123, HSA Memory Model, by Ben Gaster

ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
HSA Foundation
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
EvonCanales257
 
survey_of_matrix_for_simulation
survey_of_matrix_for_simulationsurvey_of_matrix_for_simulation
survey_of_matrix_for_simulation
Jon Hand
 
quickguide-einnovator-5-springxd
quickguide-einnovator-5-springxdquickguide-einnovator-5-springxd
quickguide-einnovator-5-springxd
jorgesimao71
 

Similar to HSA-4123, HSA Memory Model, by Ben Gaster (20)

ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
 
lecture3.pdf
lecture3.pdflecture3.pdf
lecture3.pdf
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with OpenstackAt the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with Openstack
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : Notes
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
 
survey_of_matrix_for_simulation
survey_of_matrix_for_simulationsurvey_of_matrix_for_simulation
survey_of_matrix_for_simulation
 
quickguide-einnovator-5-springxd
quickguide-einnovator-5-springxdquickguide-einnovator-5-springxd
quickguide-einnovator-5-springxd
 

More from AMD Developer Central

Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 

More from AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

HSA-4123, HSA Memory Model, by Ben Gaster

  • 1. HSA MEMORY MODEL BENEDICT R. GASTER WWW.QUALCOMM.COM
  • 2. OUTLINE u  HSA Memory Model u  OpenCL 2.0 u  u  Has a memory model too Obstruction-free bounded deques u  An example using the HSA memory model
  • 3. HSA MEMORY MODEL © Copyright 2012 HSA Foundation. All Rights Reserved. 3
  • 4. TYPES OF MODELS u  Shared memory computers and programming languages, divide complexity into models: 1.  Memory model specifies safety u  u  2.  e.g. can a work-item prevent others from progressing? This is what this section of the tutorial will focus on Execution model specifies liveness u  u  3.  Described in Ben Sander’s tutorial section on HSAIL e.g. can a work-item prevent others from progressing Performance model specifies the big picture u  e.g. caches or branch divergence u  Specific to particular implementations and outside the scope of today’s tutorial
  • 5. THE PROBLEM u  Assume all locations (a, b, …) are initialized to 0 u  What are the values of $s2 and $s4 after execution? Work-item 1 Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; *a = 1; int x = *b; mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; *b = 1; int y = *a; initially *a = 0 && *b = 0 © Copyright 2012 HSA Foundation. All Rights Reserved. 5
  • 6. THE SOLUTION u  The memory model tells us: u  Defines the visibility of writes to memory at any given point u  Provides us with a set of possible executions © Copyright 2012 HSA Foundation. All Rights Reserved. 6
  • 7. WHAT MAKES A GOOD MEMORY MODEL* u  u  u  Programmability ; A good model should make it (relatively) easy to write multiwork-item programs. The model should be intuitive to most users, even to those who have not read the details Performance ; A good model should facilitate high-performance implementations at reasonable power, cost, etc. It should give implementers broad latitude in options Portability ; A good model would be adopted widely or at least provide backward compatibility or the ability to translate among models * S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department, University of Wisconsin–Madison, Nov. 1993. © Copyright 2012 HSA Foundation. All Rights Reserved. 7
  • 8. SEQUENTIAL CONSISTENCY (SC)* u  Axiomatic Definition u  u  u  A single processor (core) sequential if “the result of an execution is the same as if the operations had been executed in the order specified by the program.” A multiprocessor sequentially consistent if “the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.” But HW/Compiler actually implements more relaxed models, e.g. ARMv7 * L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocessor Programs. IEEE Transactions on Computers, C-28(9):690–91, Sept. 1979. © Copyright 2012 HSA Foundation. All Rights Reserved. 8
  • 9. SEQUENTIAL CONSISTENCY (SC) Work-item 0 Work-item 1 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; $s2 = 0 && $s4 = 1 © Copyright 2012 HSA Foundation. All Rights Reserved. 9
  • 10. BUT WHAT ABOUT ACTUAL HARDWARE u  u  Sequential consistency is (reasonably) easy to understand, but limits optimizations that the compiler and hardware can perform Many modern processors implement many reordering optimizations u  Store buffers (TSO*), work-items can see their own stores early u  Reorder buffers (XC*), work-items can see other work-items store early *TSO *XC – Total Store Order as implemented by Sparc and x86 – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno © Copyright 2012 HSA Foundation. All Rights Reserved. 10
  • 11. RELAXED CONSISTENCY (XC) Work-item 0 Work-item 1 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; ld_global_u32 $s2, [&b] ; ld_global_u32 $s4, [&a] ; st_global_u32 $s1, [&a] ; st_global_u32 $s3, [&b] ; $s2 = 0 && $s4 = 0 © Copyright 2012 HSA Foundation. All Rights Reserved. 11
  • 12. WHAT ARE OUR 3 Ps? u  Programmability ; XC is really pretty hard for the programmer to reason about what will be visible when u  u  u  many memory model experts have been known to get it wrong! Performance ; XC is good for performance, the hardware (compiler) is free to reorder many loads and stores, opening the door for performance and power enhancements Portability ; XC is very portable as it places very little constraints © Copyright 2012 HSA Foundation. All Rights Reserved. 12
  • 13. MY CHILDREN AND COMPUTER ARCHITECTS ALL WANT u  To have their cake and eat it! HSA Provides: The ability to enable programmers to reason and cake Put picture with kids with (relatively) intuitive model of SC, while still achieving the benefits of XC! © Copyright 2012 HSA Foundation. All Rights Reserved. 13
  • 14. SEQUENTIAL CONSISTENCY FOR DRF* u  HSA, following Java, C++11, and OpenCL 2.0 adopts SC for Data Race Free (DRF) u  u  plus some new capabilities ! (Informally) A data race occurs when two (or more) work-items access the same memory location such that: u  u  u  At least one of the accesses is a WRITE There are no intervening synchronization operations SC for DRF asks: u  u  Programmers to ensure programs are DRF under SC Implementers to ensure that all executions of DRF programs on the relaxed model are also SC executions *S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 2–14, May 1990 © Copyright 2012 HSA Foundation. All Rights Reserved. 14
  • 15. HSA SUPPORTS RELEASE CONSISTENCY u  HSA’s memory model is based on RCSC: u  All ld_acq and st_rel are SC u  u  u  u  Means coherence on all ld_acq and st_rel to a single address. All ld_acq and st_rel are program ordered per work-item (actually: sequenceorder by language constraints) Similar model adopted by ARMv8 HSA extends RCSC to SC for HRF*, to access the full capabilities of modern heterogeneous systems, containing CPUs, GPUs, and DSPs, for example. *Sequential Consistency for Heterogeneous-Race-Free Programmer-centric Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R. Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13. © Copyright 2012 HSA Foundation. All Rights Reserved. 15
  • 16. MAKING RELAXED CONSISTENCY WORK Work-item 0 Work-item 1 mov_u32 $s1, 1 ; st_global_u32_rel $s1, [&a] ; ld_global_u32_acq $s2, [&b] ; mov_u32 $s3, 1 ; st_global_u32_rel $s3, [&b] ; ld_global_u32_acq $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; st_global_u32_rel $s1, [&a] ; ld_global_u32_acq $s2, [&b] ; st_global_u32_rel $s3, [&b] ; ld_global_u32_acq $s4, [&a] ; $s2 = 0 && $s4 = 1 © Copyright 2012 HSA Foundation. All Rights Reserved. 16
  • 17. SEQUENTIAL CONSISTENCY FOR DRF u  Two memory accesses participate in a data race if they u  access the same location u  at least one access is a store u  can occur simultaneously u  u  u  HSA: Not good enough! i.e. appear as adjacent operations in interleaving. A program is data-race-free if no possible execution results in a data race. Sequential consistency for data-race-free programs u  Avoid everything else
  • 18. ALL ARE NOT EQUAL – OR SOME CAN SEE BETTER THAN OTHERS u  Remember the HSAIL Execution Model platform scope device scope group scope wave scope © Copyright 2012 HSA Foundation. All Rights Reserved. 18
  • 19. DATA-RACE-FREE IS NOT ENOUGH } Two ordinary memory accesses participate in SGlobal race if they a data } Access same location S12 S34 } At least one is a store t1 t2 t3 t4 } Can occur simultaneously Not a data race… Is it SC? group #3-4 group #1-2 Well that depends t1 t2 t3 t4 visibility implied by st_global 1, [&X] causality? st_global_rel 0, [&flag] atomic_cas_global_ar 1, 0, [&flag] ... st_global_rel 0, [&flag] atomic_cas_global_ar ,1 0, [&flag] ld (??), [&x]
  • 20. SEQUENTIAL CONSISTENCY FOR HETEROGNEOUS-RACE-FREE u  Two memory accesses participate in a heterogeneous race if u  access the same location u  at least one access is a store u  can occur simultaneously u  u  u  u  i.e. appear as adjacent operations in interleaving. Are not synchronized with “enough” scope A program is heterogeneous-race-free if no possible execution results in a heterogeneous race. Sequential consistency for heterogeneous-race-free programs u  Avoid everything else
  • 21. HSA HETEROGENEOUS RACE FREE u  HRF0: Basic Scope Synchronization u  u  “enough” = both threads synchronize using identical scope Recall example: u  Contains a heterogeneous race in HSA HSA Conclusion: This is t2 bad. Don’t dot3 it. Workgroup #3-4 Workgroup #1-2 t1 st_global 1, [&X] st_global_rel_wg 0, [&flag] ... t4 Diff eren t Sc ope atomic_cas_global_ar_wg,1 0, [&flag] ld (??), [&x]
  • 22. HOW TO USE HSA WITH SCOPES } Want: For performance, use smallest scope possible } What is safe in HSA? Use smallest scope that includes all producers/consumers of shared data HSA Scope Selection Guideline Is this a valid assumption? Implication: Producers/consumers must be known at synchronization time
  • 23. REGULAR GPGPU WORKLOADS M N Generally: HSA works well with Communicate Communicate Partition Locally Hierarchically regular data-parallel workloads Globally Define Problem Space N times Well defined (regular) data partitioning + Well defined (regular) synchronization pattern = u  Producer/consumers are always known M times
  • 24. IRREGULAR WORKLOADS u  HSA: example is race u  u  Must upgrade wg (workgroup) -> plat (platform) HSA memory model says: u  ld $s1, [&x], will see value (1)! Workgroup #1-2 Workgroup #3-4 t1 t2 t3 t4 st_global 1, [&X] st_global_rel_plat 0, [&flag] atomic_cas_global_ar_plat 1, 0, [&flag] ... st_global_rel_plat 0, [&flag] atomic_cas_global_ar_plat ,1 0, [&flag] ld $s1, [&x]
  • 25. OPENCL 2.0 HAS A MEMORY MODEL TOO MAPPING ONTO HSA’S MEMORY MODEL
  • 26. OPENCL 2.0 BACKGROUND u  u  u  Provisional specification released at SIGGRAPH’13, July 2013. Huge update to OpenCL to account for the evolving hardware landscape and emerging use cases (e.g. irregular work loads) Key features: u  Shared virtual memory, including platform atomics u  Formally defined memory model based on C11 plus support for scopes u  Includes an extended set of C1X atomic operations u  Generic address space, that subsumes global, local, and private u  Device to device enqueue u  Out-of-order device side queuing model u  Backwards compatible with OpenCL 1.x
  • 27. OPENCL 1.X MEMORY MODEL MAPPING u  It is straightforward to provide a mapping from OpenCL 1.x to the proposed model OpenCL Operation HSA Memory Model Operation Load ld_global_wg ld_group_wg Store st_global_wg st_group_wg atomic_op atomic_op_global_comp atomic_op_group_wg barrier(…) sync_wg u  OpenCL 1.x atomics are unordered and so map to atomic_op_X u  Mapping for fences not shown but straightforward
  • 28. OPENCL 2.0 MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Load memory_order_relaxed atomic_ld_[global | group]_scope Store Memory_order_relaxed atomic_st_[global | group]_scope Load memory_order_acquire ld_[global | group]_acq_scope Load memory_order_seq_cst fence_rel_scope ; ld_[global | group]_acq_scope Store memory_order_release st_[global | group]_rel_scope Store Memory_order_seq_cst st_[global | group]_rel_scope ; fence_acq_scope memory_order_acq_rel atomic_op_[global | group]_acq_rel_scope memory_order_seq_cst atomic_op_[global|group]_acq_rel_scope
  • 29. OPENCL 2.0 MEMORY SCOPE MAPPING OpenCL Scope HSA Scope memory_scope_work_group _wg memory_scope_device _component memory_scope_all_svm_devices _platform © Copyright 2012 HSA Foundation. All Rights Reserved. 29
  • 30. OBSTRUCTION-FREE BOUNDED DEQUES AN EXAMPLE USING THE HSA MEMORY MODEL
  • 31. CONCURRENT DATA-STRUCTURES u  u  u  Why do we need such a memory model in practice? One important application of memory consistency is in the development and use of concurrent data-structures In particular, there are a class data-structures implementations that provide non-blocking guarantees: u  wait-free; An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes u  u  lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of the work-items (or threads) makes progress u  u  In practice very hard to build efficient data-structures that meet this requirement In practice lock-free algorithms are implemented by work-item cooperating with one enough to allow progress Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can make progress
  • 32. BUT WAY NOT JUST USE MUTUAL EXCLUSION? Emerging Compute Cluster ?? ?? Krait Krait Diversity in a heterogeneous system, such as CPU CPU Adreno Hexagon GPU different clock speeds, different DSP scheduling Krait Krait CPU CPU MMU MMU can mean traditional mutual policies, and more MMUs exclusion is not the L2 right choice 2MB Fabric & Memory Controller © Copyright 2012 HSA Foundation. All Rights Reserved. 32
  • 33. CONCURRENT DATA-STRUCTURES u  Emerging heterogeneous compute clusters means we need: u  u  u  u  u  To adapt existing concurrent data-structures Developer new concurrent data-structures Lock based programming may still be useful but often these algorithms will need to be lock-free Of course, this is a key application of the HSA memory model To showcase this we highlight the development of a well known (HLM) obstruction-free deque* *Herlihy, M. et al. 2003. Obstruction-free synchronization: double-ended queues as an example. (2003), 522–529. © Copyright 2012 HSA Foundation. All Rights Reserved. 33
  • 34. HLM - OBSTRUCTION-FREE DEQUE u  Uses a fixed length circular queue u  At any given time, reading from left to right, the array will contain: u  u  Zero or more dummy-null (DN) values u  u  Zero or more left-null (LN) values Zero or more right-null (RN) values At all times there must be: u  At least two different nulls values u  u  At least one LN or DN, and at least one DN or RN Memory consistency is required to allow multiple producers and multiple consumers, potentially happening in parallel from the left and right ends, to see changes from other work-items (HSA Components) and threads (HSA Agents) © Copyright 2012 HSA Foundation. All Rights Reserved. 34
  • 35. HLM - OBSTRUCTION-FREE DEQUE left LN LN LN v right v RN RN RN Key: LN RN v left right © Copyright 2012 HSA Foundation. All Rights Reserved. – left null value – right null value – value – left hint index – right hint index 35
  • 36. C REPRESENTATION OF DEQUE struct node { uint64_t type : 2; // null type (LN, RN, DN) uint64_t counter : 8 ; // version counter to avoid ABA uint64_t value : 54 ; // index value stored in queue } struct queue { unsigned int size; node * array; // size of bounded buffer // backing store for deque itself } © Copyright 2012 HSA Foundation. All Rights Reserved. 36
  • 37. HSAIL REPRESENTATION u  Allocate a deque in global memory using HSAIL @deque_instance: align 64 global_u32 &size; align 8 global_u64 &array; © Copyright 2012 HSA Foundation. All Rights Reserved. 37
  • 38. ORACLE u  Assume a function: function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue); u  Which given a deque u  returns (%k) the position of the left most of RN u  ld_global_acq used to read node from array u  Makes one if necessary (i.e. if there are only LN or DN) u  atomic_cas_global_ar, required to make new RN u  returns (%left) the left node (i.e. the value to the left of the left most RN position) u  returns (%right) the right node (i.e. the value at position (%k)) © Copyright 2012 HSA Foundation. All Rights Reserved. 38
  • 39. RIGHT POP function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) { // load queue address ld_arg_u64 $d0, [%deque]; @loop_forever: // setup and call right oracle to get next RN arg_u32 %k; arg_u64 %current; arg_u64 %next; call &rcheck_oracle(%queue) ; ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next]; // current.value($d5) shr_u64 $d5, $d1, 62; // current.counter($d6) and_u64 $d6, $d1, 0x3FC0000000000000; shr_u64 $d6, $d6, 54; // current.value($d7) and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF; // next.counter($d8) and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54; brn @loop_forever ; } © Copyright 2012 HSA Foundation. All Rights Reserved. 39
  • 40. RIGHT POP – TEST FOR EMPTY // current.type($d5) == LN || current.type($d5) == DN cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN; or_b1 $c0, $c0, $c1; cbr $c0, @not_empty ; // current node index (%deque($d0) + (%k($s1) - 1) * 16) add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0; ld_global_acq_u64 $d4, [$d3]; cmp_neq_b1_u64 $c0, $d4, $d1; cbr $c0, @not_empty; st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY %ret @not_empty: © Copyright 2012 HSA Foundation. All Rights Reserved. 40
  • 41. RIGHT POP – TRY READ/REMOVE NODE // $d9 = (RN, next.cnt+1, 0) add_u64 $d8, $d8, 1; shl_u64 $d9, RN, 62; and_u64 $d8, $d8, $d9; // cas(deq+k, next, node(RN, next.cnt+1, 0)) atomic_cas_global_ar_u64 $d9, [$s0], $d2, $d9; cmp_neq_u64 $c0, $d9, $d2; cbr $c0, @cas_failed; // $d9 = (RN, current.cnt+1, 0) add_u64 $d6, $d6, 1; shl_u64 $d9, RN, 62; and_u64 $d9, $d6, $d9; // cas(deq+(k-1), curr, node(RN, curr.cnt+1,0) atomic_cas_global_ar_u64 $d9, [$s1], $d1, $d9; cmp_neq_u64 $c0, $d9, $d1; cbr $c0, @cas_failed; st_arg_u32 SUCCESS, [&err]; st_arg_u64 $d7, [&value]; %ret @cas_failed: // loop back around and try again © Copyright 2012 HSA Foundation. All Rights Reserved. 41
  • 42. TAKE AWAYS u  HSA provides a powerful and modern memory model u  u  Defined as Release Consistency u  u  Based on the well know SC for DRF Extended with scopes as defined by HRF OpenCL 2.0 introduces a new memory model u  u  Also defined in terms of Release Consistency u  Also Extended with scope as defined in HRF u  u  Also based on SC for DRF Has a well defined mapping to HSA Concurrent algorithm development for emerging heterogeneous computing cluster can benefit from HSA and OpenCL 2.0 memory models © Copyright 2012 HSA Foundation. All Rights Reserved. 42