3. HSA Design (2015-04-30) @ NCKU, Tainan
What is HSA?
3
An intelligent computing architecture that enables CPU, GPU and other
processors to work in harmony on a single piece of silicon by seamlessly
moving the right tasks to the best suited processing element.
4. HSA Design (2015-04-30) @ NCKU, Tainan
Three Eras of Processor Performance
4
?
Single-thread
Performance
Time
we are
here
Enabled by:
Moore’s Observation
Voltage Scaling
Micro-Architecture
Constrained by:
Power
Complexity
Single-Core Era
ModernApplication
Performance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
Moore’s Observation
Abundant data parallelism
Power efficient data parallel
processing (GPUs)
Constrained by:
Programming models
Communication overheads
Throughput
Performance
Time (# of processors)
we are
here
Enabled by:
Moore’s Observation
Desire for Throughput
20 years of SMP arch
Constrained by:
Power
Parallel SW availability
Scalability
Multi-Core Era
Assembly C/C++ Java … pthreads OpenMP / TBB …
Shader CUDA OpenCL
C++ and Java
SOURCE : HSA INTRODUCTION, HSA FOUNDATION (PHIL ROGERS, AMD)
5. HSA Design (2015-04-30) @ NCKU, Tainan
HSA Foundation
5
Founded in June 2012
www.hsafoundation.com
Developing a new platform for heterogeneous
systems
Launched the official v1.0 specification set in
March 2015
6. HSA Design (2015-04-30) @ NCKU, Tainan
HSA Foundation Members (April 2015)
6
Founders
Promoters
Contributors
Academics
Supporters
7. HSA Design (2015-04-30) @ NCKU, Tainan
HSA Platform Model
7
In HSA system, a regular device is called an HSA agent, and if the HSA
agent can run kernels then it is also an HSA kernel agent.
Compute Unit (CU)
Compute Unit (CU)
Compute Unit (CU)
Compute Unit (CU)
Compute Unit
(CU)
Lane
(Processing Element)
Host CPU
(OS, HSA runtime)
HSA Kernel Agent
Compute Unit (CU)
Compute Unit (CU)
Wavefront Size
(A power of 2 in the range from 1 to 256 inclusive)
HSA Agent
SIMD
Data Parallel
Workloads
Serial and Task
Parallel Workloads
Jay Wang, Taiwan, 2015.03
8. HSA Design (2015-04-30) @ NCKU, Tainan
HSA Intermediate Language (HSAIL)
8
The HSA Foundation members are building a heterogeneous compute software ecosystem
built on open, royalty-free industry standards and open-source software: the HSA
runtimes and compilation tools are based on open-source technologies such as LLVM and
GCC. ( https://github.com/HSAFoundation )
Company D
GPU
...
Other
Hardware
Accelerator
Company B
CPUs
Finalizer
(Company A - CPU)
Finalizer
(Company B - CPU)
Finalizer
(Company C - GPU)
Finalizer
(Company D - GPU)
Finalizer
(Company E - DSP)
Finalizer
(...)
OpenMP DSL
Virtual Parallel
ISA
CLOC –
Compile OpenCL
kernels to HSAIL
HSA Intermediate Language (HSAIL)
OpenCL C++AMP Java
Company A
CPUs
Company C
GPU
Company E
DSP
Parallel
Programming
Languages
HSA Runtime
Libraries
Jay Wang, Taiwan,
2014.10
12. HSA Design (2015-04-30) @ NCKU, Tainan
HSA Memory Consistency Model
(Relaxed Model)
Second Operation
ld_rlx
st_rlx
atomic_rlx
atomicNoRet_rlx
atomic_acq
atomicNoRet_acq
fence_acq
atomic_rel
atomicNoRet_rel
fence_rel
atomic_ar
atomicNoRet_ar
fence_ar
First
Operation
ld_rlx or st_rlx yes yes yes yes no no
atomic_rlx
atomicNoRet_rlx
yes yes yes no no no
atomic_acq
atomicNoRet_acq
fence_acq
no no no no no no
atomic_rel
atomicNoRet_rel
yes yes no no no no
fence_rel yes no no no no no
atomic_ar
atomicNoRet_ar
fence_ar
no no no no no no
12
relaxed ;
…..
acquire ;
…..
release ;
…..
acq_rel ;
…..
13. HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
13
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
15. HSA Design (2015-04-30) @ NCKU, Tainan
Host CPUs GPU(HSA Agent)
(HSA Kernel Agent)
Shared Virtual Memory
System Memory GPU Memory
Jay Wang, Taiwan, 2015.04
Shared Virtual Memory (HSA)
15
32-bit HSA System
(32 bits VA)
64-bit HSA System
(≥ 48 bits VA)
IOMMU
OS Page Table
MMU
16. HSA Design (2015-04-30) @ NCKU, Tainan
Group Segments within
Flat Address Space
Global Segment within
Flat Address Space
Private Segments within
Flat Address Space
Kernel Dispatch Grid
Work-Group Work-Group
WI WI WI
Private Segment
WI WI WI
Private Segment
Group Segment
Group Segment
Global Segment
Flat Address SpaceHSA Agent
$s0
$s1
$s2
$s3
$s4
$s5
$s6
$s7
$s124
$s125
$s126
$s127
32-bit
Registers
( s registers)
$c0
$c1
$c2
$c3
$c4
$c5
$c6
$c7
$d0
$d1
$d2
$d3
$d62
$d63
64-bit
Registers
( d registers)
$q0
$q31
$q1
128-bit
Registers
( q registers)
1-bit
Control Registers
( c registers)
Local Registers per Work-Item
Jay Wang, Taiwan,
2014.10
HSA Memory Hierarchy
16
1) Global
2) Group
3) Private
4) Kernarg
5) Readonly
6) Spill
7) Arg Virtual Address Range Reservation
(System Memory or Device Local Memory)
17. HSA Design (2015-04-30) @ NCKU, Tainan
Group Segments within
Flat Address Space
Global Segment within
Flat Address Space
Private Segments within
Flat Address Space
Kernel Dispatch Grid
Work-Group Work-Group
WI WI WI
Private Segment
WI WI WI
Private Segment
Group Segment
Group Segment
Global Segment
Flat Address Space
HSA
Kernel Agent
Host CPUs
Jay Wang, Taiwan,
2015.04
Cache Coherency Domains
17
System Memory
Cache
Cache
Cache
Coherency
18. HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
18
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
19. HSA Design (2015-04-30) @ NCKU, Tainan
Signaling and Synchronization
The required mechanisms for HSAIL and the HSA runtime are:
Allocate/Destroy an HSA signal
Read the current HSA signal value
Wait on an HSA signal to meet a specified condition (with a maximum wait duration
requested)
Send an HSA signal value
Atomic read-modify-write an HSA signal value
19
sem_init()
sem_wait()
sem_post()
sem_destroy()
pthread_mutex_init()
pthread_mutex_lock()
pthread_mutex_unlock()
pthread_mutex_destroy()
Signal Handle
(hsa_signal_t)
Signal Value
(hsa_signal_value_t)
HSA
Kernel Agent
Host CPU
HSA Runtime
APIs
HSAIL
Instructions
Implementation-
defined data
Sig32 or Sig64
Jay Wang, Taiwan, 2015.04
21. HSA Design (2015-04-30) @ NCKU, Tainan
HSAIL Instructions for Signaling
21
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model,
Compiler Writer’s Guide, and Object Format (BRIG) (v1.0)
6.8 Notification (signal) Instructions
22. HSA Design (2015-04-30) @ NCKU, Tainan
Atomic Memory Operations
HSA requires the following standard atomic memory operations to be
supported by HSA Kernel Agents (other HSA Agents only need to
support the subset of these operations required by their role in the
system):
Load from memory
Store to memory
Fetch from memory, apply logic operation (bitwise AND/OR/XOR)
with one addition operand, and store back.
Fetch from memory, apply integer arithmetic operation (add,
subtract, increment, decrement, minimum, maximum) with one
addition operand, and store back.
Exchange memory location with operand.
Compare-and-swap (CAS); load memory location, compare with first
operand, if equal than store second operand back to memory
location.
22
23. HSA Design (2015-04-30) @ NCKU, Tainan
Timestamp
(64-bit)
Host CPU
HSA
Runtime
APIs
HSAIL
Clock
Instruction
Timestamp
Frequency
(1~400MHz)
HSA Runtime
HSA
Kernel Agent
Jay Wang, Taiwan, 2015.04
HSA System Timestamp
The HSA system provide for a low overhead mechanism of determining the
passing of time.
A system timestamp is required that can be read from HSAIL or through the
HSA runtime.
It is also possible to determine the system timestamp frequency through the
HSA runtime.
23
24. HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
24
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
25. HSA Design (2015-04-30) @ NCKU, Tainan
User Model Queuing
Multiple user-level
command queues
Runtime-allocated
Architected Queuing
Language (AQL)
25
HSA Kernel Agent
K
A
CPU
A
HSA Runtime
HSA
Application
(HSA Agent)
CPU
Language
Runtime
(ex: OpenCL runtime)
User Application
HSA
Finalizers
HSA Kernel Agent
GPU
HSA
Kernel Mode
Driver
CPU
K
A
A
Jay Wang, Taiwan, 2015.04
K
AQL
Kernel Dispatch Queue
A
AQL
Agent Dispatch Queue
26. HSA Design (2015-04-30) @ NCKU, Tainan
HSA Packet Processor
26
type
features
base_address
doorbell_signal
0x00
0x04
0x08
0x10
0x0C
0x14
size0x18
reserved (must be 0)0x1C
write_index (64-bit)read_index (64-bit)
base_address +
( (read_index%size) * AQL packet size )
base_address +
( (write_index%size) * AQL packet size )
Support single or multiple producers
Support KERNEL_DISPATCH and/or
AGENT_DISPATCH packet
AQL Packet (64 Bytes)
User Mode Queue Structure (hsa_queue_t)
Ring Buffer
id
0x20
0x24
Jay Wang, Taiwan, 2015.03
30. HSA Design (2015-04-30) @ NCKU, Tainan
header
return_address
arg0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
type
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
arg1
arg2
arg3
reserved
completion_signal
031 1516
Jay Wang, Taiwan, 2015.03
Agent Dispatch Packet
30
64-bit direct or indirect
arguments
Pointer to location to
store the function
return value(s) in
The function to be performed by the destination agent.
The function codes are application defined.
31. HSA Design (2015-04-30) @ NCKU, Tainan
header
dep_signal0
0x00
0x04
0x08
0x10
0x0C
0x14
0x18
0x1C
reserved
reserved
0x20
0x24
0x28
0x30
0x2C
0x34
0x38
0x3C
reserved
completion_signal
dep_signal1
dep_signal2
dep_signal3
dep_signal4
031 1516
Jay Wang, Taiwan, 2015.03
Barrier-AND / Barrier-OR Packet
The Barrier packet defines dependencies for the HSA Packet Processor
to monitor.
The HSA Packet Processor will not launch any further packets until the Barrier-
AND / Barrier-OR packet is complete.
31
Handles for dependent
signaling objects to be
evaluated by the packet
processor.
32. HSA Design (2015-04-30) @ NCKU, Tainan
Packet Process Flow
All preceding packets in the queue must have completed their launch phase.
If the barrier bit in the packet header is set than all preceding packets in the
queue must have completed.
An acquire memory fence is applied for Kernel/Agent Dispatch packets
before the packet enters the active phase.
Kernel Dispatch packets and Agent Dispatch packets execute on the Kernel
Agent/Agent, and the active phase ends when the task completes.
Barrier-AND and Barrier-OR packets remain in the active phase until their
condition is met.
If the packet is a Barrier-AND or Barrier-OR packet then an acquire memory
fence is applied as the first step.
After execution of the acquire fence, the memory release fence is applied.
After the memory release fence completes, the signal specified by the
completion_signal field in the AQL packet is signaled with a decrementing
atomic operation.
32
Launch Phase
Active Phase
Completion Phase
33. HSA Design (2015-04-30) @ NCKU, Tainan
Barrier-bit Example
33
completionSignal
AQL Packet
Barrier bit = 1
DequeueEnqueue
LaunchPhase
ActivePhase
CompletionPhase
Jay Wang, Taiwan, 2015.04
If barrier bit is set, then
processing of the packet will
only begin when all preceding
packets are complete.
37. HSA Design (2015-04-30) @ NCKU, Tainan
Kernel Agent Context Switching
37
AQL Queue
AQL Queue
AQL Queue
AQL Queue
Non-HSA Task Pool
AQL Queue
#1
#2
#3
HSA
Agent
Scheduling
Compute Unit
(CU)
Compute Unit
(CU)
Compute Unit
(CU)
HSA Kernel Agent
Context
Switching
Kernel
Program
Kernel
Program
Kernel
Program
WG
WG
WG
1. Switch ( Required )
2. Preempt ( Required as soon as possible )
3. Terminate and context reset (Terminated as fast as possible)
Jay Wang, Taiwan, 2015.04
38. HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
38
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
39. HSA Design (2015-04-30) @ NCKU, Tainan
FP Exception Reporting
A Kernel Agent shall report certain defined exceptions related to the
execution of the HSAIL code to the HSA Runtime.
39
Lane
0
Lane
1
Lane
2
Lane
(N-1)
Lane
3
Work
Item
Work
Item
Work
Item
Work
Item
Work
Item
Lane
4
Work
Item
Work-Group 0 Work-Group 2Work-Group 1 Work-Group X
avefront 0 Wavefront 1 Wavefront 2 Wavefront 3 Wavefront Y
Work-Group 1
Compute Unit (CU)
PC
HSA Kernel Agent
Wavefront 2
SIMD (Single Instruction, Multiple Data) style
HSA Runtime
Host CPU
Exception Module
Control Directive
enablebreakexceptions #EC
Signaling
Exception
Code
Description
Invalid operatoin
Divide-by-zero
Overflow
Underflow
Inexact
0
1
2
3
4
IEEE754-2008
Jay Wang, Taiwan, 2015.04
enabledetectexceptions #EC
DETECT
Policy
BREAK
Policy
BreakEn bits
DetectEn bits
Status bits
Exception
Handler
HSAIL Instruction
cleardetectexcept_u32
getdetectexcept_u32
setdetectexcept_u32
40. HSA Design (2015-04-30) @ NCKU, Tainan
Debug Infrastructure
The Kernel Agent shall provide mechanisms to allow system software
and some select application software (for example, debuggers and
profilers) to set breakpoints and collect throughput information for
profiling.
40
Lane
0
Lane
1
Lane
2
Lane
(N-1)
Lane
3
Work
Item
Work
Item
Work
Item
Work
Item
Work
Item
Lane
4
Work
Item
Work-Group 0 Work-Group 2Work-Group 1
Wavefront 0 Wavefront 1 Wavefront 2 Wavefront 3
Grid
Work-Group 1
Compute
Unit
PC
HSA Kernel Agent
Wavefront 2
SIMD (Single Instruction, Multiple Data) style
Host CPU
(HSA Agent)
Debuggers
HSA
Kernel Agent
Debug Inteface
Profilers
Debug Module
Conditional
Breakpoint
Memory
Breakpoint
Jay Wang, Taiwan, 2015.04
Instruction
Breakpoint
41. HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
41
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
44. HSA Design (2015-04-30) @ NCKU, Tainan
System Arch. Requirements
1. Shared Virtual Memory
2. Cache Coherency Domains
3. Flat Addressing
4. Endianess
5. Signaling and Synchronization
6. Atomic Memory Operations
7. HSA System Timestamp
8. User Mode Queuing
9. Architected Queuing Language (AQL)
10. Agent Scheduling
11. Kernel Agent Context Switching
12. IEEE754-2008 Floating Point Exceptions
13. Kernel Agent Hardware Debug Infrastructure
14. HSA Platform Topology Discovery
15. Images
44
@ HSA PLATFORM SYSTEM ARCHITECTURE SPECIFICATION, VERSION 1.0 FINAL (2015-03-16)
45. HSA Design (2015-04-30) @ NCKU, Tainan
Images
A graphics feature that can
sometimes be useful in data-
parallel computing
Used to store one-, two-, or
three-dimensional images
predefined image formats
Image memory is a special kind
of memory access
Dedicated hardware to speed
up image operations.
45
The OpenCL™ Specification
Version 2.1:
5.3 Image Objects
https://www.khronos.org/registry/cl/specs/opencl-2.1.pdf
Image Channel Type
Image Channel Order
Image Geometry
Image Data Size
Image Handle
(hsa_ext_image_handle_t)
Image Data
(1D, 2D, or 3D images)
Global Segment
Image
Data
Image Descriptor
HSA Kernel Agent
HSA Runtime
Image Object
rdimage
ldimage
stimage
Jay Wang, Taiwan, 2015.04
46. HSA Design (2015-04-30) @ NCKU, Tainan
Summary
Programming model issues
HSA Intermediate Language (HSAIL) + HSA Runtime
Architected Queuing Language (AQL) + Signaling
Debug infrastructure
Communication overhead issues
Cache coherent shared virtual memory (CC-SVM)
Architected Queuing Language (AQL) for user mode queuing
Hardware-assisted signaling and atomic operations for synchronization
46
CPUs GPU DSP
...
HSAIL
Unified Coherent Memory
HSA Runtime
AQL
Jay Wang, Taiwan, 2015.04