SlideShare une entreprise Scribd logo
1  sur  42
HSA MEMORY MODEL
HOT CHIPS TUTORIAL - AUGUST 2013
BENEDICT R GASTER
WWW.QUALCOMM.COM
OUTLINE
 HSA Memory Model
 OpenCL 2.0
 Has a memory model too
 Obstruction-free bounded deques
 An example using the HSA memory model
HSA MEMORY MODEL
© Copyright 2012 HSA Foundation. All Rights Reserved. 3
TYPES OF MODELS
 Shared memory computers and programming languages, divide complexity
into models:
1. Memory model specifies safety
 e.g. can a work-item prevent others from progressing?
 This is what this section of the tutorial will focus on
2. Execution model specifies liveness
 Described in Ben Sander’s tutorial section on HSAIL
 e.g. can a work-item prevent others from progressing
3. Performance model specifies the big picture
 e.g. caches or branch divergence
 Specific to particular implementations and outside the scope of today’s tutorial
THE PROBLEM
 Assume all locations (a, b, …) are initialized to 0
 What are the values of $s2 and $s4 after execution?
© Copyright 2012 HSA Foundation. All Rights Reserved. 5
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
*a = 1;
int x = *b;
*b = 1;
int y = *a;
initially *a = 0 && *b = 0
THE SOLUTION
 The memory model tells us:
 Defines the visibility of writes to memory at any given point
 Provides us with a set of possible executions
© Copyright 2012 HSA Foundation. All Rights Reserved. 6
WHAT MAKES A GOOD MEMORY MODEL*
 Programmability ; A good model should make it (relatively) easy to write multi-
work-item programs. The model should be intuitive to most users, even to
those who have not read the details
 Performance ; A good model should facilitate high-performance
implementations at reasonable power, cost, etc. It should give implementers
broad latitude in options
 Portability ; A good model would be adopted widely or at least provide
backward compatibility or the ability to translate among models
* S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD
thesis, Computer Sciences Department, University of Wisconsin–Madison, Nov. 1993.
© Copyright 2012 HSA Foundation. All Rights Reserved. 7
SEQUENTIAL CONSISTENCY (SC)*
 Axiomatic Definition
 A single processor (core) sequential if ―the result of an execution is the same as if
the operations had been executed in the order specified by the program.‖
 A multiprocessor sequentially consistent if ―the result of any execution is the same
as if the operations of all processors (cores) were executed in some sequential
order, and the operations of each individual processor (core) appear in this
sequence in the order specified by its program.‖
© Copyright 2012 HSA Foundation. All Rights Reserved. 8
 But HW/Compiler actually implements more relaxed models, e.g. ARMv7
* L. Lamport. How to Make a Multiprocessor Computer that Correctly
Executes Multiprocessor Programs. IEEE Transactions on
Computers, C-28(9):690–91, Sept. 1979.
SEQUENTIAL CONSISTENCY (SC)
© Copyright 2012 HSA Foundation. All Rights Reserved. 9
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
$s2 = 0 && $s4 = 1
BUT WHAT ABOUT ACTUAL HARDWARE
 Sequential consistency is (reasonably) easy to understand, but limits
optimizations that the compiler and hardware can perform
 Many modern processors implement many reordering optimizations
 Store buffers (TSO*), work-items can see their own stores early
 Reorder buffers (XC*), work-items can see other work-items store early
© Copyright 2012 HSA Foundation. All Rights Reserved. 10
*TSO – Total Store Order as implemented by Sparc and x86
*XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno
RELAXED CONSISTENCY (XC)
© Copyright 2012 HSA Foundation. All Rights Reserved. 11
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32 $s1, [&a] ;
ld_global_u32 $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32 $s3, [&b] ;
ld_global_u32 $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
ld_global_u32 $s2, [&b] ;
ld_global_u32 $s4, [&a] ;
st_global_u32 $s1, [&a] ;
st_global_u32 $s3, [&b] ;
$s2 = 0 && $s4 = 0
WHAT ARE OUR 3 Ps?
 Programmability ; XC is really pretty hard for the programmer to reason about
what will be visible when
 many memory model experts have been known to get it wrong!
 Performance ; XC is good for performance, the hardware (compiler) is free to
reorder many loads and stores, opening the door for performance and power
enhancements
 Portability ; XC is very portable as it places very little constraints
© Copyright 2012 HSA Foundation. All Rights Reserved. 12
MY CHILDREN AND COMPUTER
ARCHITECTS ALL WANT
 To have their cake and eat it!
© Copyright 2012 HSA Foundation. All Rights Reserved. 13
Put picture with kids and cake
HSA Provides: The ability to enable
programmers to reason with (relatively)
intuitive model of SC, while still achieving the
benefits of XC!
SEQUENTIAL CONSISTENCY FOR DRF*
 HSA, following Java, C++11, and OpenCL 2.0 adopts SC for Data Race Free
(DRF)
 plus some new capabilities !
 (Informally) A data race occurs when two (or more) work-items access the
same memory location such that:
 At least one of the accesses is a WRITE
 There are no intervening synchronization operations
 SC for DRF asks:
 Programmers to ensure programs are DRF under SC
 Implementers to ensure that all executions of DRF programs on the relaxed model
are also SC executions
© Copyright 2012 HSA Foundation. All Rights Reserved. 14
*S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the
17th Annual International Symposium on Computer Architecture, pp. 2–14, May 1990
HSA SUPPORTS RELEASE CONSISTENCY
 HSA’s memory model is based on RCSC:
 All ld_acq and st_rel are SC
 Means coherence on all ld_acq and st_rel to a single address.
 All ld_acq and st_rel are program ordered per work-item (actually: sequence-
order by language constraints)
 Similar model adopted by ARMv8
 HSA extends RCSC to SC for HRF*, to access the full capabilities of modern
heterogeneous systems, containing CPUs, GPUs, and DSPs, for example.
© Copyright 2012 HSA Foundation. All Rights Reserved. 15
*Sequential Consistency for Heterogeneous-Race-Free Programmer-centric
Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R.
Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.
MAKING RELAXED CONSISTENCY WORK
© Copyright 2012 HSA Foundation. All Rights Reserved. 16
Work-item 0
mov_u32 $s1, 1 ;
st_global_u32_rel $s1, [&a] ;
ld_global_u32_acq $s2, [&b] ;
Work-item 1
mov_u32 $s3, 1 ;
st_global_u32_rel $s3, [&b] ;
ld_global_u32_acq $s4, [&a] ;
mov_u32 $s1, 1 ;
mov_u32 $s3, 1;
st_global_u32_rel $s1, [&a] ;
ld_global_u32_acq $s2, [&b] ;
st_global_u32_rel $s3, [&b] ;
ld_global_u32_acq $s4, [&a] ;
$s2 = 0 && $s4 = 1
SEQUENTIAL CONSISTENCY FOR DRF
 Two memory accesses participate in a data race if they
 access the same location
 at least one access is a store
 can occur simultaneously
 i.e. appear as adjacent operations in interleaving.
 A program is data-race-free if no possible execution
results in a data race.
 Sequential consistency for data-race-free programs
 Avoid everything else
HSA: Not good enough!
ALL ARE NOT EQUAL – OR SOME CAN
SEE BETTER THAN OTHERS
 Remember the HSAIL Execution Model
© Copyright 2012 HSA Foundation. All Rights Reserved. 18
device scope
group scope
wave scope
platform scope
DATA-RACE-FREE IS NOT ENOUGH
t1 t2 t3 t4
st_global 1, [&X]
st_global_rel 0, [&flag]
atomic_cas_global_ar 1, 0, [&flag]
...
st_global_rel 0, [&flag]
atomic_cas_global_ar ,1 0, [&flag]
ld (??), [&x]
group #1-2 group #3-4
 Two ordinary memory accesses participate in a data race if they
 Access same location
 At least one is a store
 Can occur simultaneously
Not a data race…
Is it SC?
Well that depends
t4t3t1 t2
SGlobal
S12 S34
visibility implied by
causality?
SEQUENTIAL CONSISTENCY FOR
HETEROGNEOUS-RACE-FREE
 Two memory accesses participate in a heterogeneous race if
 access the same location
 at least one access is a store
 can occur simultaneously
 i.e. appear as adjacent operations in interleaving.
 Are not synchronized with ―enough‖ scope
 A program is heterogeneous-race-free if no possible
execution results in a heterogeneous race.
 Sequential consistency for heterogeneous-race-free
programs
 Avoid everything else
HSA HETEROGENEOUS RACE FREE
 HRF0: Basic Scope Synchronization
 ―enough‖ = both threads synchronize using identical scope
 Recall example:
 Contains a heterogeneous race in HSA
t1 t2 t3 t4
st_global 1, [&X]
st_global_rel_wg 0, [&flag]
...
atomic_cas_global_ar_wg,1 0, [&flag]
ld (??), [&x]
Workgroup #1-2 Workgroup #3-4
HSA Conclusion:
This is bad. Don’t do it.
HOW TO USE HSA WITH SCOPES
Use smallest scope that includes all
producers/consumers of shared data
HSA Scope Selection Guideline
Implication:
Producers/consumers must be known at synchronization time
 Want: For performance, use smallest scope possible
 What is safe in HSA?
Is this a valid assumption?
REGULAR GPGPU WORKLOADS
N
M
Define
Problem Space
Partition
Hierarchically
Communicate
Locally
N times
Communicate
Globally
M times
Well defined (regular) data partitioning +
Well defined (regular) synchronization pattern =
 Producer/consumers are always known
Generally: HSA works well with
regular data-parallel workloads
t1 t2 t3 t4
st_global 1, [&X]
st_global_rel_plat 0, [&flag]
atomic_cas_global_ar_plat 1, 0, [&flag]
...
st_global_rel_plat 0, [&flag]
atomic_cas_global_ar_plat ,1 0, [&flag]
ld $s1, [&x]
IRREGULAR WORKLOADS
 HSA: example is race
 Must upgrade wg (workgroup) -> plat (platform)
 HSA memory model says:
 ld $s1, [&x], will see value (1)!
Workgroup #1-2 Workgroup #3-4
OPENCL 2.0
HAS A MEMORY MODEL TOO
MAPPING ONTO HSA’S MEMORY MODEL
OPENCL 2.0 BACKGROUND
 Provisional specification released at SIGGRAPH’13, July 2013.
 Huge update to OpenCL to account for the evolving hardware landscape and
emerging use cases (e.g. irregular work loads)
 Key features:
 Shared virtual memory, including platform atomics
 Formally defined memory model based on C11 plus support for scopes
 Includes an extended set of C1X atomic operations
 Generic address space, that subsumes global, local, and private
 Device to device enqueue
 Out-of-order device side queuing model
 Backwards compatible with OpenCL 1.x
 It is straightforward to provide a mapping from OpenCL 1.x to the proposed
model
 OpenCL 1.x atomics are unordered and so map to atomic_op_X
 Mapping for fences not shown but straightforward
OPENCL 1.X MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model
Operation
Load ld_global_wg
ld_group_wg
Store st_global_wg
st_group_wg
atomic_op atomic_op_global_comp
atomic_op_group_wg
barrier(…) sync_wg
OPENCL 2.0 MEMORY MODEL MAPPING
OpenCL Operation HSA Memory Model Operation
Load
memory_order_relaxed
atomic_ld_[global | group]_scope
Store
Memory_order_relaxed
atomic_st_[global | group]_scope
Load
memory_order_acquire
ld_[global | group]_acq_scope
Load
memory_order_seq_cst
fence_rel_scope ; ld_[global | group]_acq_scope
Store
memory_order_release
st_[global | group]_rel_scope
Store
Memory_order_seq_cst
st_[global | group]_rel_scope ; fence_acq_scope
memory_order_acq_rel atomic_op_[global | group]_acq_rel_scope
memory_order_seq_cst atomic_op_[global|group]_acq_rel_scope
OPENCL 2.0 MEMORY SCOPE MAPPING
OpenCL Scope HSA Scope
memory_scope_work_group _wg
memory_scope_device _component
memory_scope_all_svm_devices _platform
© Copyright 2012 HSA Foundation. All Rights Reserved. 29
OBSTRUCTION-FREE
BOUNDED DEQUES
AN EXAMPLE USING THE HSA MEMORY MODEL
CONCURRENT DATA-STRUCTURES
 Why do we need such a memory model in practice?
 One important application of memory consistency is in the development and
use of concurrent data-structures
 In particular, there are a class data-structures implementations that provide
non-blocking guarantees:
 wait-free; An algorithm is wait-free if every operation has a bound on the number of
steps the algorithm will take before the operation completes
 In practice very hard to build efficient data-structures that meet this requirement
 lock-free; An algorithm is lock-free if every if, given enough time, at least one thread
of the work-items (or threads) makes progress
 In practice lock-free algorithms are implemented by work-item cooperating with
one enough to allow progress
 Obstruction-free; An algorithm is obstruction-free if a work-item, running in
isolation, can make progress
Emerging Compute Cluster
BUT WAY NOT JUST USE MUTUAL
EXCLUSION?
© Copyright 2012 HSA Foundation. All Rights Reserved. 32
Fabric & Memory Controller
Krait
CPUAdreno
GPU
Krait
CPU
Krait
CPU
Krait
CPU
MMU
MMUs
2MB L2
Hexagon
DSP
MMU
?? ??
Diversity in a heterogeneous system, such as
different clock speeds, different scheduling
policies, and more can mean traditional mutual
exclusion is not the right choice
CONCURRENT DATA-STRUCTURES
 Emerging heterogeneous compute clusters means we need:
 To adapt existing concurrent data-structures
 Developer new concurrent data-structures
 Lock based programming may still be useful but often these algorithms will
need to be lock-free
 Of course, this is a key application of the HSA memory model
 To showcase this we highlight the development of a well known (HLM)
obstruction-free deque*
© Copyright 2012 HSA Foundation. All Rights Reserved. 33
*Herlihy, M. et al. 2003. Obstruction-free
synchronization: double-ended queues as an example.
(2003), 522–529.
HLM - OBSTRUCTION-FREE DEQUE
 Uses a fixed length circular queue
 At any given time, reading from left to right, the array will contain:
 Zero or more left-null (LN) values
 Zero or more dummy-null (DN) values
 Zero or more right-null (RN) values
 At all times there must be:
 At least two different nulls values
 At least one LN or DN, and at least one DN or RN
 Memory consistency is required to allow multiple producers and multiple
consumers, potentially happening in parallel from the left and right ends, to see
changes from other work-items (HSA Components) and threads (HSA Agents)
© Copyright 2012 HSA Foundation. All Rights Reserved. 34
HLM - OBSTRUCTION-FREE DEQUE
© Copyright 2012 HSA Foundation. All Rights Reserved. 35
LNLN vLN RNv RNRN
left right
Key:
LN – left null value
RN – right null value
v – value
left – left hint index
right – right hint index
C REPRESENTATION OF DEQUE
struct node {
uint64_t type : 2; // null type (LN, RN, DN)
uint64_t counter : 8 ; // version counter to avoid ABA
uint64_t value : 54 ; // index value stored in queue
}
struct queue {
unsigned int size; // size of bounded buffer
node * array; // backing store for deque itself
}
© Copyright 2012 HSA Foundation. All Rights Reserved. 36
HSAIL REPRESENTATION
 Allocate a deque in global memory using HSAIL
@deque_instance:
align 64 global_u32 &size;
align 8 global_u64 &array;
© Copyright 2012 HSA Foundation. All Rights Reserved. 37
ORACLE
 Assume a function:
function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);
 Which given a deque
 returns (%k) the position of the left most of RN
 ld_global_acq used to read node from array
 Makes one if necessary (i.e. if there are only LN or DN)
 atomic_cas_global_ar, required to make new RN
 returns (%left) the left node (i.e. the value to the left of the left most RN position)
 returns (%right) the right node (i.e. the value at position (%k))
© Copyright 2012 HSA Foundation. All Rights Reserved. 38
RIGHT POP
function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) {
// load queue address
ld_arg_u64 $d0, [%deque];
@loop_forever:
// setup and call right oracle to get next RN
arg_u32 %k; arg_u64 %current; arg_u64 %next;
call &rcheck_oracle(%queue) ;
ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next];
// current.value($d5)
shr_u64 $d5, $d1, 62;
// current.counter($d6)
and_u64 $d6, $d1,
0x3FC0000000000000;
shr_u64 $d6, $d6, 54;
// current.value($d7)
and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF;
// next.counter($d8)
and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54; brn @loop_forever ;
}
© Copyright 2012 HSA Foundation. All Rights Reserved. 39
RIGHT POP – TEST FOR EMPTY
// current.type($d5) == LN || current.type($d5) == DN
cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN;
or_b1 $c0, $c0, $c1;
cbr $c0, @not_empty ;
// current node index (%deque($d0) + (%k($s1) - 1) * 16)
add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0;
ld_global_acq_u64 $d4, [$d3];
cmp_neq_b1_u64 $c0, $d4, $d1;
cbr $c0, @not_empty;
st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY
%ret
@not_empty:
© Copyright 2012 HSA Foundation. All Rights Reserved. 40
RIGHT POP – TRY READ/REMOVE NODE
// $d9 = (RN, next.cnt+1, 0)
add_u64 $d8, $d8, 1;
shl_u64 $d9, RN, 62;
and_u64 $d8, $d8, $d9;
// cas(deq+k, next, node(RN, next.cnt+1, 0))
atomic_cas_global_ar_u64 $d9, [$s0], $d2, $d9;
cmp_neq_u64 $c0, $d9, $d2;
cbr $c0, @cas_failed;
// $d9 = (RN, current.cnt+1, 0)
add_u64 $d6, $d6, 1;
shl_u64 $d9, RN, 62;
and_u64 $d9, $d6, $d9;
// cas(deq+(k-1), curr, node(RN, curr.cnt+1,0)
atomic_cas_global_ar_u64 $d9, [$s1], $d1, $d9;
cmp_neq_u64 $c0, $d9, $d1;
cbr $c0, @cas_failed;
st_arg_u32 SUCCESS, [&err];
st_arg_u64 $d7, [&value];
%ret
@cas_failed:
// loop back around and try again
© Copyright 2012 HSA Foundation. All Rights Reserved. 41
TAKE AWAYS
 HSA provides a powerful and modern memory model
 Based on the well know SC for DRF
 Defined as Release Consistency
 Extended with scopes as defined by HRF
 OpenCL 2.0 introduces a new memory model
 Also based on SC for DRF
 Also defined in terms of Release Consistency
 Also Extended with scope as defined in HRF
 Has a well defined mapping to HSA
 Concurrent algorithm development for emerging heterogeneous computing
cluster can benefit from HSA and OpenCL 2.0 memory models
© Copyright 2012 HSA Foundation. All Rights Reserved. 42

Contenu connexe

Tendances

ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAIL
HSA Foundation
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
HSA Foundation
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing Model
HSA Foundation
 
ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - Runtime
HSA Foundation
 

Tendances (20)

HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013
 
ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAIL
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 
HSA Overview
HSA Overview HSA Overview
HSA Overview
 
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
HC-4015, An Overview of the HSA System Architecture Requirements, by Paul Bli...
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing Model
 
ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - Runtime
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
C11/C++11 Memory model. What is it, and why?
C11/C++11 Memory model. What is it, and why?C11/C++11 Memory model. What is it, and why?
C11/C++11 Memory model. What is it, and why?
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
 

Similaire à HSA Memory Model Hot Chips 2013

Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 
Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond Hadoop
HPCC Systems
 
quickguide-einnovator-5-springxd
quickguide-einnovator-5-springxdquickguide-einnovator-5-springxd
quickguide-einnovator-5-springxd
jorgesimao71
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 

Similaire à HSA Memory Model Hot Chips 2013 (20)

hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
lecture3.pdf
lecture3.pdflecture3.pdf
lecture3.pdf
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
At the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with OpenstackAt the Crossroads of HPC and Cloud Computing with Openstack
At the Crossroads of HPC and Cloud Computing with Openstack
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Big data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond HadoopBig data processing using HPCC Systems Above and Beyond Hadoop
Big data processing using HPCC Systems Above and Beyond Hadoop
 
Apache Mesos Overview and Integration
Apache Mesos Overview and IntegrationApache Mesos Overview and Integration
Apache Mesos Overview and Integration
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStack
 
quickguide-einnovator-5-springxd
quickguide-einnovator-5-springxdquickguide-einnovator-5-springxd
quickguide-einnovator-5-springxd
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Survive JavaScript - Strategies and Tricks
Survive JavaScript - Strategies and TricksSurvive JavaScript - Strategies and Tricks
Survive JavaScript - Strategies and Tricks
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
 
Spark,Hadoop,Presto Comparition
Spark,Hadoop,Presto ComparitionSpark,Hadoop,Presto Comparition
Spark,Hadoop,Presto Comparition
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 

Plus de HSA Foundation

ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
HSA Foundation
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
HSA Foundation
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
HSA Foundation
 

Plus de HSA Foundation (14)

KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
 
Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 Provisional
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
 
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshare
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012
 
Hsa2012 logo guidelines.
Hsa2012 logo guidelines.Hsa2012 logo guidelines.
Hsa2012 logo guidelines.
 
What Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAWhat Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSA
 
Fabric Engine: Why HSA is Invaluable
Fabric Engine: Why HSA is  InvaluableFabric Engine: Why HSA is  Invaluable
Fabric Engine: Why HSA is Invaluable
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

HSA Memory Model Hot Chips 2013

  • 1. HSA MEMORY MODEL HOT CHIPS TUTORIAL - AUGUST 2013 BENEDICT R GASTER WWW.QUALCOMM.COM
  • 2. OUTLINE  HSA Memory Model  OpenCL 2.0  Has a memory model too  Obstruction-free bounded deques  An example using the HSA memory model
  • 3. HSA MEMORY MODEL © Copyright 2012 HSA Foundation. All Rights Reserved. 3
  • 4. TYPES OF MODELS  Shared memory computers and programming languages, divide complexity into models: 1. Memory model specifies safety  e.g. can a work-item prevent others from progressing?  This is what this section of the tutorial will focus on 2. Execution model specifies liveness  Described in Ben Sander’s tutorial section on HSAIL  e.g. can a work-item prevent others from progressing 3. Performance model specifies the big picture  e.g. caches or branch divergence  Specific to particular implementations and outside the scope of today’s tutorial
  • 5. THE PROBLEM  Assume all locations (a, b, …) are initialized to 0  What are the values of $s2 and $s4 after execution? © Copyright 2012 HSA Foundation. All Rights Reserved. 5 Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; *a = 1; int x = *b; *b = 1; int y = *a; initially *a = 0 && *b = 0
  • 6. THE SOLUTION  The memory model tells us:  Defines the visibility of writes to memory at any given point  Provides us with a set of possible executions © Copyright 2012 HSA Foundation. All Rights Reserved. 6
  • 7. WHAT MAKES A GOOD MEMORY MODEL*  Programmability ; A good model should make it (relatively) easy to write multi- work-item programs. The model should be intuitive to most users, even to those who have not read the details  Performance ; A good model should facilitate high-performance implementations at reasonable power, cost, etc. It should give implementers broad latitude in options  Portability ; A good model would be adopted widely or at least provide backward compatibility or the ability to translate among models * S. V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors. PhD thesis, Computer Sciences Department, University of Wisconsin–Madison, Nov. 1993. © Copyright 2012 HSA Foundation. All Rights Reserved. 7
  • 8. SEQUENTIAL CONSISTENCY (SC)*  Axiomatic Definition  A single processor (core) sequential if ―the result of an execution is the same as if the operations had been executed in the order specified by the program.‖  A multiprocessor sequentially consistent if ―the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.‖ © Copyright 2012 HSA Foundation. All Rights Reserved. 8  But HW/Compiler actually implements more relaxed models, e.g. ARMv7 * L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocessor Programs. IEEE Transactions on Computers, C-28(9):690–91, Sept. 1979.
  • 9. SEQUENTIAL CONSISTENCY (SC) © Copyright 2012 HSA Foundation. All Rights Reserved. 9 Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; $s2 = 0 && $s4 = 1
  • 10. BUT WHAT ABOUT ACTUAL HARDWARE  Sequential consistency is (reasonably) easy to understand, but limits optimizations that the compiler and hardware can perform  Many modern processors implement many reordering optimizations  Store buffers (TSO*), work-items can see their own stores early  Reorder buffers (XC*), work-items can see other work-items store early © Copyright 2012 HSA Foundation. All Rights Reserved. 10 *TSO – Total Store Order as implemented by Sparc and x86 *XC – Relaxed Consistency model, e.g. ARMv7, Power7, and Adreno
  • 11. RELAXED CONSISTENCY (XC) © Copyright 2012 HSA Foundation. All Rights Reserved. 11 Work-item 0 mov_u32 $s1, 1 ; st_global_u32 $s1, [&a] ; ld_global_u32 $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32 $s3, [&b] ; ld_global_u32 $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; ld_global_u32 $s2, [&b] ; ld_global_u32 $s4, [&a] ; st_global_u32 $s1, [&a] ; st_global_u32 $s3, [&b] ; $s2 = 0 && $s4 = 0
  • 12. WHAT ARE OUR 3 Ps?  Programmability ; XC is really pretty hard for the programmer to reason about what will be visible when  many memory model experts have been known to get it wrong!  Performance ; XC is good for performance, the hardware (compiler) is free to reorder many loads and stores, opening the door for performance and power enhancements  Portability ; XC is very portable as it places very little constraints © Copyright 2012 HSA Foundation. All Rights Reserved. 12
  • 13. MY CHILDREN AND COMPUTER ARCHITECTS ALL WANT  To have their cake and eat it! © Copyright 2012 HSA Foundation. All Rights Reserved. 13 Put picture with kids and cake HSA Provides: The ability to enable programmers to reason with (relatively) intuitive model of SC, while still achieving the benefits of XC!
  • 14. SEQUENTIAL CONSISTENCY FOR DRF*  HSA, following Java, C++11, and OpenCL 2.0 adopts SC for Data Race Free (DRF)  plus some new capabilities !  (Informally) A data race occurs when two (or more) work-items access the same memory location such that:  At least one of the accesses is a WRITE  There are no intervening synchronization operations  SC for DRF asks:  Programmers to ensure programs are DRF under SC  Implementers to ensure that all executions of DRF programs on the relaxed model are also SC executions © Copyright 2012 HSA Foundation. All Rights Reserved. 14 *S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 2–14, May 1990
  • 15. HSA SUPPORTS RELEASE CONSISTENCY  HSA’s memory model is based on RCSC:  All ld_acq and st_rel are SC  Means coherence on all ld_acq and st_rel to a single address.  All ld_acq and st_rel are program ordered per work-item (actually: sequence- order by language constraints)  Similar model adopted by ARMv8  HSA extends RCSC to SC for HRF*, to access the full capabilities of modern heterogeneous systems, containing CPUs, GPUs, and DSPs, for example. © Copyright 2012 HSA Foundation. All Rights Reserved. 15 *Sequential Consistency for Heterogeneous-Race-Free Programmer-centric Memory Models for Heterogeneous Platforms. D. R. Hower, Beckman, B. R. Gaster, B. Hechtman, M D. Hill, S. K. Reinhart, and D. Wood. MSPC’13.
  • 16. MAKING RELAXED CONSISTENCY WORK © Copyright 2012 HSA Foundation. All Rights Reserved. 16 Work-item 0 mov_u32 $s1, 1 ; st_global_u32_rel $s1, [&a] ; ld_global_u32_acq $s2, [&b] ; Work-item 1 mov_u32 $s3, 1 ; st_global_u32_rel $s3, [&b] ; ld_global_u32_acq $s4, [&a] ; mov_u32 $s1, 1 ; mov_u32 $s3, 1; st_global_u32_rel $s1, [&a] ; ld_global_u32_acq $s2, [&b] ; st_global_u32_rel $s3, [&b] ; ld_global_u32_acq $s4, [&a] ; $s2 = 0 && $s4 = 1
  • 17. SEQUENTIAL CONSISTENCY FOR DRF  Two memory accesses participate in a data race if they  access the same location  at least one access is a store  can occur simultaneously  i.e. appear as adjacent operations in interleaving.  A program is data-race-free if no possible execution results in a data race.  Sequential consistency for data-race-free programs  Avoid everything else HSA: Not good enough!
  • 18. ALL ARE NOT EQUAL – OR SOME CAN SEE BETTER THAN OTHERS  Remember the HSAIL Execution Model © Copyright 2012 HSA Foundation. All Rights Reserved. 18 device scope group scope wave scope platform scope
  • 19. DATA-RACE-FREE IS NOT ENOUGH t1 t2 t3 t4 st_global 1, [&X] st_global_rel 0, [&flag] atomic_cas_global_ar 1, 0, [&flag] ... st_global_rel 0, [&flag] atomic_cas_global_ar ,1 0, [&flag] ld (??), [&x] group #1-2 group #3-4  Two ordinary memory accesses participate in a data race if they  Access same location  At least one is a store  Can occur simultaneously Not a data race… Is it SC? Well that depends t4t3t1 t2 SGlobal S12 S34 visibility implied by causality?
  • 20. SEQUENTIAL CONSISTENCY FOR HETEROGNEOUS-RACE-FREE  Two memory accesses participate in a heterogeneous race if  access the same location  at least one access is a store  can occur simultaneously  i.e. appear as adjacent operations in interleaving.  Are not synchronized with ―enough‖ scope  A program is heterogeneous-race-free if no possible execution results in a heterogeneous race.  Sequential consistency for heterogeneous-race-free programs  Avoid everything else
  • 21. HSA HETEROGENEOUS RACE FREE  HRF0: Basic Scope Synchronization  ―enough‖ = both threads synchronize using identical scope  Recall example:  Contains a heterogeneous race in HSA t1 t2 t3 t4 st_global 1, [&X] st_global_rel_wg 0, [&flag] ... atomic_cas_global_ar_wg,1 0, [&flag] ld (??), [&x] Workgroup #1-2 Workgroup #3-4 HSA Conclusion: This is bad. Don’t do it.
  • 22. HOW TO USE HSA WITH SCOPES Use smallest scope that includes all producers/consumers of shared data HSA Scope Selection Guideline Implication: Producers/consumers must be known at synchronization time  Want: For performance, use smallest scope possible  What is safe in HSA? Is this a valid assumption?
  • 23. REGULAR GPGPU WORKLOADS N M Define Problem Space Partition Hierarchically Communicate Locally N times Communicate Globally M times Well defined (regular) data partitioning + Well defined (regular) synchronization pattern =  Producer/consumers are always known Generally: HSA works well with regular data-parallel workloads
  • 24. t1 t2 t3 t4 st_global 1, [&X] st_global_rel_plat 0, [&flag] atomic_cas_global_ar_plat 1, 0, [&flag] ... st_global_rel_plat 0, [&flag] atomic_cas_global_ar_plat ,1 0, [&flag] ld $s1, [&x] IRREGULAR WORKLOADS  HSA: example is race  Must upgrade wg (workgroup) -> plat (platform)  HSA memory model says:  ld $s1, [&x], will see value (1)! Workgroup #1-2 Workgroup #3-4
  • 25. OPENCL 2.0 HAS A MEMORY MODEL TOO MAPPING ONTO HSA’S MEMORY MODEL
  • 26. OPENCL 2.0 BACKGROUND  Provisional specification released at SIGGRAPH’13, July 2013.  Huge update to OpenCL to account for the evolving hardware landscape and emerging use cases (e.g. irregular work loads)  Key features:  Shared virtual memory, including platform atomics  Formally defined memory model based on C11 plus support for scopes  Includes an extended set of C1X atomic operations  Generic address space, that subsumes global, local, and private  Device to device enqueue  Out-of-order device side queuing model  Backwards compatible with OpenCL 1.x
  • 27.  It is straightforward to provide a mapping from OpenCL 1.x to the proposed model  OpenCL 1.x atomics are unordered and so map to atomic_op_X  Mapping for fences not shown but straightforward OPENCL 1.X MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Load ld_global_wg ld_group_wg Store st_global_wg st_group_wg atomic_op atomic_op_global_comp atomic_op_group_wg barrier(…) sync_wg
  • 28. OPENCL 2.0 MEMORY MODEL MAPPING OpenCL Operation HSA Memory Model Operation Load memory_order_relaxed atomic_ld_[global | group]_scope Store Memory_order_relaxed atomic_st_[global | group]_scope Load memory_order_acquire ld_[global | group]_acq_scope Load memory_order_seq_cst fence_rel_scope ; ld_[global | group]_acq_scope Store memory_order_release st_[global | group]_rel_scope Store Memory_order_seq_cst st_[global | group]_rel_scope ; fence_acq_scope memory_order_acq_rel atomic_op_[global | group]_acq_rel_scope memory_order_seq_cst atomic_op_[global|group]_acq_rel_scope
  • 29. OPENCL 2.0 MEMORY SCOPE MAPPING OpenCL Scope HSA Scope memory_scope_work_group _wg memory_scope_device _component memory_scope_all_svm_devices _platform © Copyright 2012 HSA Foundation. All Rights Reserved. 29
  • 30. OBSTRUCTION-FREE BOUNDED DEQUES AN EXAMPLE USING THE HSA MEMORY MODEL
  • 31. CONCURRENT DATA-STRUCTURES  Why do we need such a memory model in practice?  One important application of memory consistency is in the development and use of concurrent data-structures  In particular, there are a class data-structures implementations that provide non-blocking guarantees:  wait-free; An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes  In practice very hard to build efficient data-structures that meet this requirement  lock-free; An algorithm is lock-free if every if, given enough time, at least one thread of the work-items (or threads) makes progress  In practice lock-free algorithms are implemented by work-item cooperating with one enough to allow progress  Obstruction-free; An algorithm is obstruction-free if a work-item, running in isolation, can make progress
  • 32. Emerging Compute Cluster BUT WAY NOT JUST USE MUTUAL EXCLUSION? © Copyright 2012 HSA Foundation. All Rights Reserved. 32 Fabric & Memory Controller Krait CPUAdreno GPU Krait CPU Krait CPU Krait CPU MMU MMUs 2MB L2 Hexagon DSP MMU ?? ?? Diversity in a heterogeneous system, such as different clock speeds, different scheduling policies, and more can mean traditional mutual exclusion is not the right choice
  • 33. CONCURRENT DATA-STRUCTURES  Emerging heterogeneous compute clusters means we need:  To adapt existing concurrent data-structures  Developer new concurrent data-structures  Lock based programming may still be useful but often these algorithms will need to be lock-free  Of course, this is a key application of the HSA memory model  To showcase this we highlight the development of a well known (HLM) obstruction-free deque* © Copyright 2012 HSA Foundation. All Rights Reserved. 33 *Herlihy, M. et al. 2003. Obstruction-free synchronization: double-ended queues as an example. (2003), 522–529.
  • 34. HLM - OBSTRUCTION-FREE DEQUE  Uses a fixed length circular queue  At any given time, reading from left to right, the array will contain:  Zero or more left-null (LN) values  Zero or more dummy-null (DN) values  Zero or more right-null (RN) values  At all times there must be:  At least two different nulls values  At least one LN or DN, and at least one DN or RN  Memory consistency is required to allow multiple producers and multiple consumers, potentially happening in parallel from the left and right ends, to see changes from other work-items (HSA Components) and threads (HSA Agents) © Copyright 2012 HSA Foundation. All Rights Reserved. 34
  • 35. HLM - OBSTRUCTION-FREE DEQUE © Copyright 2012 HSA Foundation. All Rights Reserved. 35 LNLN vLN RNv RNRN left right Key: LN – left null value RN – right null value v – value left – left hint index right – right hint index
  • 36. C REPRESENTATION OF DEQUE struct node { uint64_t type : 2; // null type (LN, RN, DN) uint64_t counter : 8 ; // version counter to avoid ABA uint64_t value : 54 ; // index value stored in queue } struct queue { unsigned int size; // size of bounded buffer node * array; // backing store for deque itself } © Copyright 2012 HSA Foundation. All Rights Reserved. 36
  • 37. HSAIL REPRESENTATION  Allocate a deque in global memory using HSAIL @deque_instance: align 64 global_u32 &size; align 8 global_u64 &array; © Copyright 2012 HSA Foundation. All Rights Reserved. 37
  • 38. ORACLE  Assume a function: function &rcheck_oracle (arg_u32 %k, arg_u64 %left, arg_u64 %right) (arg_u64 %queue);  Which given a deque  returns (%k) the position of the left most of RN  ld_global_acq used to read node from array  Makes one if necessary (i.e. if there are only LN or DN)  atomic_cas_global_ar, required to make new RN  returns (%left) the left node (i.e. the value to the left of the left most RN position)  returns (%right) the right node (i.e. the value at position (%k)) © Copyright 2012 HSA Foundation. All Rights Reserved. 38
  • 39. RIGHT POP function &right_pop(arg_u32err, arg_u64 %result) (arg_u64 %deque) { // load queue address ld_arg_u64 $d0, [%deque]; @loop_forever: // setup and call right oracle to get next RN arg_u32 %k; arg_u64 %current; arg_u64 %next; call &rcheck_oracle(%queue) ; ld_arg_u32 $s0, [%k]; ld_arg_u64 $d1, [%current]; ld_arg_u64 $d2, [%next]; // current.value($d5) shr_u64 $d5, $d1, 62; // current.counter($d6) and_u64 $d6, $d1, 0x3FC0000000000000; shr_u64 $d6, $d6, 54; // current.value($d7) and_u64 $d7, $d1, 0x3FFFFFFFFFFFFF; // next.counter($d8) and_u64 $d8, $d2, 0x3FC0000000000000; shr_u64 $d8, $d8, 54; brn @loop_forever ; } © Copyright 2012 HSA Foundation. All Rights Reserved. 39
  • 40. RIGHT POP – TEST FOR EMPTY // current.type($d5) == LN || current.type($d5) == DN cmp_neq_b1_u64 $c0, $d5, LN; cmp_neq_b1_u64 $c1, $d5, DN; or_b1 $c0, $c0, $c1; cbr $c0, @not_empty ; // current node index (%deque($d0) + (%k($s1) - 1) * 16) add_u32 $s1, $s0, -1; mul_u32 $s1, $s1, 16; add_u32 $d3, $d0, $s0; ld_global_acq_u64 $d4, [$d3]; cmp_neq_b1_u64 $c0, $d4, $d1; cbr $c0, @not_empty; st_arg_u32 EMPTY, [&err]; // deque empty so return EMPTY %ret @not_empty: © Copyright 2012 HSA Foundation. All Rights Reserved. 40
  • 41. RIGHT POP – TRY READ/REMOVE NODE // $d9 = (RN, next.cnt+1, 0) add_u64 $d8, $d8, 1; shl_u64 $d9, RN, 62; and_u64 $d8, $d8, $d9; // cas(deq+k, next, node(RN, next.cnt+1, 0)) atomic_cas_global_ar_u64 $d9, [$s0], $d2, $d9; cmp_neq_u64 $c0, $d9, $d2; cbr $c0, @cas_failed; // $d9 = (RN, current.cnt+1, 0) add_u64 $d6, $d6, 1; shl_u64 $d9, RN, 62; and_u64 $d9, $d6, $d9; // cas(deq+(k-1), curr, node(RN, curr.cnt+1,0) atomic_cas_global_ar_u64 $d9, [$s1], $d1, $d9; cmp_neq_u64 $c0, $d9, $d1; cbr $c0, @cas_failed; st_arg_u32 SUCCESS, [&err]; st_arg_u64 $d7, [&value]; %ret @cas_failed: // loop back around and try again © Copyright 2012 HSA Foundation. All Rights Reserved. 41
  • 42. TAKE AWAYS  HSA provides a powerful and modern memory model  Based on the well know SC for DRF  Defined as Release Consistency  Extended with scopes as defined by HRF  OpenCL 2.0 introduces a new memory model  Also based on SC for DRF  Also defined in terms of Release Consistency  Also Extended with scope as defined in HRF  Has a well defined mapping to HSA  Concurrent algorithm development for emerging heterogeneous computing cluster can benefit from HSA and OpenCL 2.0 memory models © Copyright 2012 HSA Foundation. All Rights Reserved. 42

Notes de l'éditeur

  1. Add causaltly bubble for commiunication between blue and yellow workgroups.
  2. Unlike OpenCL 1.x the execution model and memory model for OpenCL 2.0 are very well defined. Possible to rigorously argue about the behavior of OpenCL 2.0 applications.
  3. Need to confirm these are going to be the names for scopes in HSAIL
  4. We use work-items and threads here because if both work-items executing on HSA components and other concurrently executing threads on HSA Agents are accessing the same data-structure (potentially) concurrently, then they must both satisfy the requirements. Clearly all wait-free algorithms are lock-free
  5. On the following slides we high light some aspects of the algorithm and how HSA’s memory model helps to enforce memory consistency
  6. Double CAS would allow us to store pointers here, rather than the more limited 54bit size value, which would have to be used as an index.LL/SC (as ARM supports) might be an alternative to avoid needing the version counter and use the bottom 2 bits of an aligned address to store the null type information.
  7. Align to a cache boundary
  8. Queue mapped to $d0Current mapped to $d1Next mapped to $d2K mapped to $s0