SlideShare une entreprise Scribd logo
1  sur  43
HSA QUEUEING
HOT CHIPS TUTORIAL - AUGUST 2013
IAN BRATT
PRINCIPAL ENGINEER
ARM
HSA QUEUEING, MOTIVATION
MOTIVATION (TODAY’S PICTURE)
© Copyright 2012 HSA Foundation. All Rights Reserved. 3
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
HSA QUEUEING: REQUIREMENTS
REQUIREMENTS
 Requires four mechanisms to enable lower overhead job dispatch.
 Shared Virtual Memory
 System Coherency
 Signaling
 User mode queueing
© Copyright 2012 HSA Foundation. All Rights Reserved. 5
SHARED VIRTUAL MEMORY
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
 Multiple Virtual memory address spaces
© Copyright 2012 HSA Foundation. All Rights Reserved. 7
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA)
 Common Virtual Memory for all HSA agents
© Copyright 2012 HSA Foundation. All Rights Reserved. 8
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
SHARED VIRTUAL MEMORY
 Advantages
 No mapping tricks, no copying back-and-forth between different PA
addresses
 Send pointers (not data) back and forth between HSA agents.
 Implications
 Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
 Common mechanisms for address translation (and servicing
address translation faults)
 Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.
© Copyright 2012 HSA Foundation. All Rights Reserved. 9
GETTING THERE …
© Copyright 2012 HSA Foundation. All Rights Reserved. 10
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SHARED VIRTUAL MEMORY
 Specifics
 Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
 HSA agents may reserve VA ranges for internal use via system
software.
 All HSA agents other than the host unit must use the lowest
privelege level
 If present, read/write access flags for page tables must be
maintained by all agents.
 Read/write permissions apply to all HSA agents, equally.
© Copyright 2012 HSA Foundation. All Rights Reserved. 11
CACHE COHERENCY
CACHE COHERENCY DOMAINS (1/3)
 Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.
© Copyright 2012 HSA Foundation. All Rights Reserved. 13
CACHE COHERENCY DOMAINS (2/3)
 Advantages
 Composability
 Reduced SW complexity when communicating between agents
 Lower barrier to entry when porting software
 Implications
 Hardware coherency support between all HSA agents
 Can take many forms
 Stand alone Snoop Filters / Directories
 Combined L3/Filters
 Snoop-based systems (no filter)
 Etc …
© Copyright 2012 HSA Foundation. All Rights Reserved. 14
GETTING CLOSER …
© Copyright 2012 HSA Foundation. All Rights Reserved. 15
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
CACHE COHERENCY DOMAINS (3/3)
 Specifics
 No requirement for instruction memory accesses to be
coherent
 Only applies to the Primary memory type.
 No requirement for HSA agents to maintain coherency
to any memory location where the HSA agents do not
specify the same memory attributes
 Read-only image data is required to remain static during
the execution of an HSA kernel.
 No double mapping (via different attributes) in order
to modify. Must remain static
© Copyright 2012 HSA Foundation. All Rights Reserved. 16
SIGNALING
SIGNALING (1/3)
 HSA agents support the ability to use signaling objects
 All creation/destruction signaling objects occurs via HSA
runtime APIs
 Object creation/destruction
 From an HSA Agent you can directly accessing
signaling objects.
 Signaling a signal object (this will wake up HSA
agents waiting upon the object)
 Query current object
 Wait on the current object (various conditions
supported).
© Copyright 2012 HSA Foundation. All Rights Reserved. 18
SIGNALING (2/3)
 Advantages
 Enables asynchronous interrupts between HSA agents,
without involving the kernel
 Common idiom for work offload
 Low power waiting
 Implications
 Runtime support required
 Commonly implemented on top of cache coherency
flows
© Copyright 2012 HSA Foundation. All Rights Reserved. 19
ALMOST THERE…
© Copyright 2012 HSA Foundation. All Rights Reserved. 20
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SIGNALING (3/3)
 Specifics
 Only supported within a PASID
 Supported wait conditions are =, !=, < and >=
 Wait operations may return sporadically (no guarantee
against false positives)
 Programmer must test.
 Wait operations have a maximum duration before
returning.
 The HSAIL atomic operations are supported on signal
objects.
 Signal objects are opaque
© Copyright 2012 HSA Foundation. All Rights Reserved. 21
USER MODE QUEUEING
USER MODE QUEUEING (1/3)
 User mode Queueing
 Enables user space applications to directly, without OS intervention, enqueue jobs
(“Dispatch Packets”) for HSA agents.
 Dispatch packet is a job of work
 Support for multiple queues per PASID
 Multiple threads/agents within a PASID may enqueue Packets in the same Queue.
 Dependency mechanisms created for ensuring ordering between packets.
© Copyright 2012 HSA Foundation. All Rights Reserved. 23
USER MODE QUEUEING (2/3)
 Advantages
 Avoid involving the kernel/driver when dispatching work for an Agent.
 Lower latency job dispatch enables finer granularity of offload
 Standard memory protection mechanisms may be used to protect communication
with the consuming agent.
 Implications
 Packet formats/fields are Architected – standard across vendors!
 Guaranteed backward compatibility
 Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signalling)
 More on this later……
© Copyright 2012 HSA Foundation. All Rights Reserved. 24
SUCCESS!
© Copyright 2012 HSA Foundation. All Rights Reserved. 25
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
SUCCESS!
© Copyright 2012 HSA Foundation. All Rights Reserved. 26
Application OS GPU
Queue Job
Start Job
Finish Job
ARCHITECTED QUEUEING
LANGUAGE, QUEUES
ARCHITECTED QUEUEING LANGUAGE
 HSA Queues look just like standard
shared memory queues, supporting
multi-producer, single-consumer
 Support is allowed for single-
producer, single-consumer
 Queues consist of storage, read/write
indices, ID, etc.
 Queues are created/destroyed via calls
to the HSA runtime
 “Packets” are placed in queues directly
from user mode, via an architected
protocol
 Packet format is architected
© Copyright 2012 HSA Foundation. All Rights Reserved. 28
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets
ARCHITECTED QUEUEING LANGUAGE
 Once a packet is enqueued, the producer signals the doorbell
 Consumers are not required to wait on the doorbell – the consumer could instead be
polling.
 The doorbell is not the synchronization mechanism (the shared memory updates
ensure the synchronization).
 Packets are read and dispatched for execution from the queue in order, but
may complete in any order.
 There is no guarantee that more than one packet will be processed in parallel at a
time
 There may be many queues. A single agent may also consume from several
queues.
 A packet processing agent may also enqueue packets.
© Copyright 2012 HSA Foundation. All Rights Reserved. 29
POTENTIAL MULTI-PRODUCER ALGORITHM
// Read the current queue write offset
tmp_WriteOffset = WriteOffset;
// wait until the queue is no longer full.
while(tmp_WriteOffset == ReadOffset + Size) {}
// Atomically bump the WriteOffset
if (WriteOffset.compare_exchange_strong(tmp_WriteOffset, tmp_WriteOffset + 1,
std::memory_order_acquire){
// calculate index
uint32_t index = tmp_WriteOffset & (Size -1);
// copy over the packet, the format field is INVALID
BaseAddress[index] = pkt;
// Update format field with release semantics
BaseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);
// ring doorbell, with release semantics (could also amortize over multiple packets)
hsa_ring_doorbell(tmp_WriteOffset+1);
}
© Copyright 2012 HSA Foundation. All Rights Reserved. 30
POTENTIAL CONSUMER ALGORITHM
// spin while empty (could also perform low-power wait on doorbell)
while (BaseAddress[ReadOffset & (Size - 1)].hdr.format == INVALID) { }
// calculate the index
uint32_t index = ReadOffset & (Size - 1);
// copy over the packet
pkt = BaseAddress[index];
// set the format field to invalid
BaseAddress[index].hdr.format.store(INVALID, std::memory_order_relaxed);
// Update the readoffset
ReadOffset.store(ReadOffset + 1, std::memory_order_release);
© Copyright 2012 HSA Foundation. All Rights Reserved. 31
ARCHITECTED QUEUEING
LANGUAGE, PACKETS
DISPATCH PACKET
© Copyright 2012 HSA Foundation. All Rights Reserved. 33
 Packets come in two main types (Dispatch and Barrier), with architected
layouts
 Dispatch packet is the most common type of packet
 Contains
 Pointer to the kernel
 Pointer to the arguments
 WorkGroupSize (x,y,z)
 gridSize(x,y,z)
 And more……
 Packets contain an additional “barrier” flag. When the barrier flag is set, no
other packets will be launched until all previously launched packets from this
queue have completed.
DISPATCH PACKET
© Copyright 2012 HSA Foundation. All Rights Reserved. 34
Offset Format Field Name Description
0 uint32_t
format:8 AQL_FORMAT: 0=INVALID, 1=DISPATCH, 2=DEPEND, others reserved
barrier:1
If set then processing of packet will only begin when all preceding packets are
complete.
acquireFenceScope:2
Determines the scope and type of the memory fence operation applied before the
job is dispatched.
releaseFenceScope:2
Determines the scope and type of the memory fence operation applied after
kernel completion but before the job is completed.
invalidateInstructionCache:1 Acquire fence additionally applies to any instruction cache(s).
invalidateROImageCache:1 Acquire fence additionally applies to any read-only image cache(s).
dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.
reserved:15
4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).
6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).
8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).
10 uint16_t reserved2
12 uint32_t gridSize.x x dimension of grid (measured in work-items).
16 uint32_t gridSize.y y dimension of grid (measured in work-items).
20 uint32_t gridSize.z z dimension of grid (measured in work-items).
24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).
28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).
32 uint64_t kernelObjectAddress
Address of an object in memory that includes an implementation-defined
executable ISA image for the kernel.
40 uint64_t kernargAddress Address of memory containing kernel arguments.
48 uint64_t reserved3
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
BARRIER PACKET
 Used for specifying dependences between packets
 HSA will not launch any further packets from this queue until the barrier packet
signal conditions are met
 Used for specifying dependences on packets dispatched from any queue.
 Execution phase completes only when all of the dependent signals (up to five) have
been signaled (with the value of 0).
 Or if an error has occurred in one of the packets upon which we have a
dependence.
© Copyright 2012 HSA Foundation. All Rights Reserved. 35
BARRIER PACKET
© Copyright 2012 HSA Foundation. All Rights Reserved. 36
Offset Format Field Name Description
0 uint32_t
format:8
AQL_FORMAT: 0=INVALID, 1=DISPATCH, 2=DEPEND, others
reserved
barrier:1
If set then processing of packet will only begin when all preceding
packets are complete.
acquireFenceScope:2
Determines the scope and type of the memory fence operation applied
before the job is dispatched.
releaseFenceScope:2
Determines the scope and type of the memory fence operation applied
after kernel completion but before the job is completed.
invalidateInstructionCache:1 Acquire fence additionally applies to any instruction cache(s).
invalidateROImageCache:1 Acquire fence additionally applies to any read-only image cache(s).
dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.
reserved:15
4 uint32_t reserved2
8 uint64_t depSignal0
Address of dependent signaling objects to be evaluated by the packet
processor.
16 uint64_t depSignal1
24 uint64_t depSignal2
32 uint64_t depSignal3
40 uint64_t depSignal4
48 uint64_t reserved3
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
DEPENDENCES
 A user may never assume more than one packet is being executed by an HSA
agent at a time.
 Implications:
 Packets can’t poll on shared memory values which will be set by packets issued
from other queues, unless the user has ensured the proper ordering.
 To ensure all previous packets from a queue have been completed, use the Barrier
bit.
 To ensure specific packets from any queue have completed, use the Barrier packet.
© Copyright 2012 HSA Foundation. All Rights Reserved. 37
HSA QUEUEING, PACKET EXECUTION
PACKET EXECUTION
 Launch phase
 Initiated when launch conditions are met
 All preceeding packets in the queue must have exited launch phase
 If the barrier bit in the packet header is set, then all preceding packets in the
queue must have exited completion phase
 Active phase
 Execute the packet
 Barrier packets remain in Active phase until conditions are met.
 Completion phase
 First step is memory release fence – make results visible.
 CompletionSignal field is then signaled with a decrementing atomic.
© Copyright 2012 HSA Foundation. All Rights Reserved. 39
PACKET EXECUTION
© Copyright 2012 HSA Foundation. All Rights Reserved. 40
Pkt1
Launch
Pkt2
Launch
Pkt1
Execute
Pkt2
Execute
Pkt1
Complete
Pkt3
Launch (barrier=1)
Pkt2
Complete
Pkt3
Execute
Time
Pkt3 stays in Launch
phase until all packets
have completed.
PUTTING IT ALL TOGETHER (FFT)
© Copyright 2012 HSA Foundation. All Rights Reserved. 41
Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6
Barrier Barrier
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
Time
PUTTING IT ALL TOGETHER
© Copyright 2012 HSA Foundation. All Rights Reserved. 42
AQL Pseudo Code
// Send the packets to do the first stage.
aql_dispatch(pkt1);
aql_dispatch(pkt2);
// Send the next two packets, setting the barrier bit so we
//know packets 1 &2 will be complete before 3 and 4 are
//launched.
aql_dispatch_with _barrier_bit(pkt3);
aql_dispatch(pkt4);
// Same as above (make sure 3 & 4 are done before issuing 5
//& 6)
aql_dispatch_with_barrier_bit(pkt5);
aql_dispatch(pkt6);
// This packet will notify us when 5 & 6 are complete)
aql_dispatch_with_barrier_bit(finish_pkt);
QUESTIONS?
© Copyright 2012 HSA Foundation. All Rights Reserved. 43

Contenu connexe

Tendances

Patroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionPatroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companion
Alexander Kukushkin
 

Tendances (20)

いまどきの OAuth / OpenID Connect (OIDC) 一挙おさらい (2020 年 2 月) #authlete
いまどきの OAuth / OpenID Connect (OIDC) 一挙おさらい (2020 年 2 月) #authleteいまどきの OAuth / OpenID Connect (OIDC) 一挙おさらい (2020 年 2 月) #authlete
いまどきの OAuth / OpenID Connect (OIDC) 一挙おさらい (2020 年 2 月) #authlete
 
OpenStackによる、実践オンプレミスクラウド
OpenStackによる、実践オンプレミスクラウドOpenStackによる、実践オンプレミスクラウド
OpenStackによる、実践オンプレミスクラウド
 
OpenStackユーザ会資料 - Masakari
OpenStackユーザ会資料 - MasakariOpenStackユーザ会資料 - Masakari
OpenStackユーザ会資料 - Masakari
 
High Performance Networking with DPDK & Multi/Many Core
High Performance Networking with DPDK & Multi/Many CoreHigh Performance Networking with DPDK & Multi/Many Core
High Performance Networking with DPDK & Multi/Many Core
 
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIORevisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
 
Pacemaker 操作方法メモ
Pacemaker 操作方法メモPacemaker 操作方法メモ
Pacemaker 操作方法メモ
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
KVM環境におけるネットワーク速度ベンチマーク
KVM環境におけるネットワーク速度ベンチマークKVM環境におけるネットワーク速度ベンチマーク
KVM環境におけるネットワーク速度ベンチマーク
 
Patroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionPatroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companion
 
NVMCT #1 ~今さら聞けないSSDの基本~
NVMCT #1 ~今さら聞けないSSDの基本~NVMCT #1 ~今さら聞けないSSDの基本~
NVMCT #1 ~今さら聞けないSSDの基本~
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
High Availability and Disaster Recovery in PostgreSQL - EQUNIX
High Availability and Disaster Recovery in PostgreSQL - EQUNIXHigh Availability and Disaster Recovery in PostgreSQL - EQUNIX
High Availability and Disaster Recovery in PostgreSQL - EQUNIX
 
Openstackを200%活用するSDSの挑戦
Openstackを200%活用するSDSの挑戦Openstackを200%活用するSDSの挑戦
Openstackを200%活用するSDSの挑戦
 
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps OnlineGKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
 
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red HatMultiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
Multiple Sites and Disaster Recovery with Ceph: Andrew Hatfield, Red Hat
 
Dockerを支える技術
Dockerを支える技術Dockerを支える技術
Dockerを支える技術
 
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaAutovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
 
ネットワークの自動監視 - Intent Based Analytics -
ネットワークの自動監視 - Intent Based Analytics -ネットワークの自動監視 - Intent Based Analytics -
ネットワークの自動監視 - Intent Based Analytics -
 

En vedette

En vedette (8)

Hsa10 whitepaper
Hsa10 whitepaperHsa10 whitepaper
Hsa10 whitepaper
 
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 

Similaire à HSA Queuing Hot Chips 2013

ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAIL
HSA Foundation
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
HSA Foundation
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
HSA Foundation
 

Similaire à HSA Queuing Hot Chips 2013 (20)

HSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian BrattHSA-4122, "HSA Queuing Mode," by Ian Bratt
HSA-4122, "HSA Queuing Mode," by Ian Bratt
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)SAS on Your (Apache) Cluster, Serving your Data (Analysts)
SAS on Your (Apache) Cluster, Serving your Data (Analysts)
 
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati..."Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentati...
 
ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAIL
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
 
Big data overview by Edgars
Big data overview by EdgarsBig data overview by Edgars
Big data overview by Edgars
 
Introduction to HSA
Introduction to HSAIntroduction to HSA
Introduction to HSA
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
 
OS Memory Management.pptx
OS Memory Management.pptxOS Memory Management.pptx
OS Memory Management.pptx
 
ch8.ppt
ch8.pptch8.ppt
ch8.ppt
 
ch8 (2).ppt
ch8 (2).pptch8 (2).ppt
ch8 (2).ppt
 
cs8493 - operating systems unit 3
cs8493 - operating systems unit 3cs8493 - operating systems unit 3
cs8493 - operating systems unit 3
 
Main memory operating system
Main memory   operating systemMain memory   operating system
Main memory operating system
 
The Council of Architecture (COA) has been constituted by the Government of I...
The Council of Architecture (COA) has been constituted by the Government of I...The Council of Architecture (COA) has been constituted by the Government of I...
The Council of Architecture (COA) has been constituted by the Government of I...
 
ch8.ppt ejejnenjfjrnjnrjngjngktgtmtommgt
ch8.ppt ejejnenjfjrnjnrjngjngktgtmtommgtch8.ppt ejejnenjfjrnjnrjngjngktgtmtommgt
ch8.ppt ejejnenjfjrnjnrjngjngktgtmtommgt
 
ch8.ppt
ch8.pptch8.ppt
ch8.ppt
 
memory management.ppt
memory management.pptmemory management.ppt
memory management.ppt
 

Plus de HSA Foundation

ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - Runtime
HSA Foundation
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
HSA Foundation
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
HSA Foundation
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
HSA Foundation
 

Plus de HSA Foundation (15)

KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
 
Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 Provisional
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)
 
ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - Runtime
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshare
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012
 
Hsa2012 logo guidelines.
Hsa2012 logo guidelines.Hsa2012 logo guidelines.
Hsa2012 logo guidelines.
 
What Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAWhat Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSA
 
Fabric Engine: Why HSA is Invaluable
Fabric Engine: Why HSA is  InvaluableFabric Engine: Why HSA is  Invaluable
Fabric Engine: Why HSA is Invaluable
 

Dernier

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Dernier (20)

Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 

HSA Queuing Hot Chips 2013

  • 1. HSA QUEUEING HOT CHIPS TUTORIAL - AUGUST 2013 IAN BRATT PRINCIPAL ENGINEER ARM
  • 3. MOTIVATION (TODAY’S PICTURE) © Copyright 2012 HSA Foundation. All Rights Reserved. 3 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 5. REQUIREMENTS  Requires four mechanisms to enable lower overhead job dispatch.  Shared Virtual Memory  System Coherency  Signaling  User mode queueing © Copyright 2012 HSA Foundation. All Rights Reserved. 5
  • 7. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (TODAY)  Multiple Virtual memory address spaces © Copyright 2012 HSA Foundation. All Rights Reserved. 7 CPU0 GPU VIRTUAL MEMORY1 PHYSICAL MEMORY VA1->PA1 VA2->PA1 VIRTUAL MEMORY2
  • 8. PHYSICAL MEMORY SHARED VIRTUAL MEMORY (HSA)  Common Virtual Memory for all HSA agents © Copyright 2012 HSA Foundation. All Rights Reserved. 8 CPU0 GPU VIRTUAL MEMORY PHYSICAL MEMORY VA->PA VA->PA
  • 9. SHARED VIRTUAL MEMORY  Advantages  No mapping tricks, no copying back-and-forth between different PA addresses  Send pointers (not data) back and forth between HSA agents.  Implications  Common Page Tables (and common interpretation of architectural semantics such as shareability, protection, etc).  Common mechanisms for address translation (and servicing address translation faults)  Concept of a process address space (PASID) to allow multiple, per process virtual address spaces within the system. © Copyright 2012 HSA Foundation. All Rights Reserved. 9
  • 10. GETTING THERE … © Copyright 2012 HSA Foundation. All Rights Reserved. 10 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 11. SHARED VIRTUAL MEMORY  Specifics  Minimum supported VA width is 48b for 64b systems, and 32b for 32b systems.  HSA agents may reserve VA ranges for internal use via system software.  All HSA agents other than the host unit must use the lowest privelege level  If present, read/write access flags for page tables must be maintained by all agents.  Read/write permissions apply to all HSA agents, equally. © Copyright 2012 HSA Foundation. All Rights Reserved. 11
  • 13. CACHE COHERENCY DOMAINS (1/3)  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance. © Copyright 2012 HSA Foundation. All Rights Reserved. 13
  • 14. CACHE COHERENCY DOMAINS (2/3)  Advantages  Composability  Reduced SW complexity when communicating between agents  Lower barrier to entry when porting software  Implications  Hardware coherency support between all HSA agents  Can take many forms  Stand alone Snoop Filters / Directories  Combined L3/Filters  Snoop-based systems (no filter)  Etc … © Copyright 2012 HSA Foundation. All Rights Reserved. 14
  • 15. GETTING CLOSER … © Copyright 2012 HSA Foundation. All Rights Reserved. 15 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 16. CACHE COHERENCY DOMAINS (3/3)  Specifics  No requirement for instruction memory accesses to be coherent  Only applies to the Primary memory type.  No requirement for HSA agents to maintain coherency to any memory location where the HSA agents do not specify the same memory attributes  Read-only image data is required to remain static during the execution of an HSA kernel.  No double mapping (via different attributes) in order to modify. Must remain static © Copyright 2012 HSA Foundation. All Rights Reserved. 16
  • 18. SIGNALING (1/3)  HSA agents support the ability to use signaling objects  All creation/destruction signaling objects occurs via HSA runtime APIs  Object creation/destruction  From an HSA Agent you can directly accessing signaling objects.  Signaling a signal object (this will wake up HSA agents waiting upon the object)  Query current object  Wait on the current object (various conditions supported). © Copyright 2012 HSA Foundation. All Rights Reserved. 18
  • 19. SIGNALING (2/3)  Advantages  Enables asynchronous interrupts between HSA agents, without involving the kernel  Common idiom for work offload  Low power waiting  Implications  Runtime support required  Commonly implemented on top of cache coherency flows © Copyright 2012 HSA Foundation. All Rights Reserved. 19
  • 20. ALMOST THERE… © Copyright 2012 HSA Foundation. All Rights Reserved. 20 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 21. SIGNALING (3/3)  Specifics  Only supported within a PASID  Supported wait conditions are =, !=, < and >=  Wait operations may return sporadically (no guarantee against false positives)  Programmer must test.  Wait operations have a maximum duration before returning.  The HSAIL atomic operations are supported on signal objects.  Signal objects are opaque © Copyright 2012 HSA Foundation. All Rights Reserved. 21
  • 23. USER MODE QUEUEING (1/3)  User mode Queueing  Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents.  Dispatch packet is a job of work  Support for multiple queues per PASID  Multiple threads/agents within a PASID may enqueue Packets in the same Queue.  Dependency mechanisms created for ensuring ordering between packets. © Copyright 2012 HSA Foundation. All Rights Reserved. 23
  • 24. USER MODE QUEUEING (2/3)  Advantages  Avoid involving the kernel/driver when dispatching work for an Agent.  Lower latency job dispatch enables finer granularity of offload  Standard memory protection mechanisms may be used to protect communication with the consuming agent.  Implications  Packet formats/fields are Architected – standard across vendors!  Guaranteed backward compatibility  Packets are enqueued/dequeued via an Architected protocol (all via memory accesses and signalling)  More on this later…… © Copyright 2012 HSA Foundation. All Rights Reserved. 24
  • 25. SUCCESS! © Copyright 2012 HSA Foundation. All Rights Reserved. 25 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory
  • 26. SUCCESS! © Copyright 2012 HSA Foundation. All Rights Reserved. 26 Application OS GPU Queue Job Start Job Finish Job
  • 28. ARCHITECTED QUEUEING LANGUAGE  HSA Queues look just like standard shared memory queues, supporting multi-producer, single-consumer  Support is allowed for single- producer, single-consumer  Queues consist of storage, read/write indices, ID, etc.  Queues are created/destroyed via calls to the HSA runtime  “Packets” are placed in queues directly from user mode, via an architected protocol  Packet format is architected © Copyright 2012 HSA Foundation. All Rights Reserved. 28 Producer Producer Consumer Read Index Write Index Storage in coherent, shared memory Packets
  • 29. ARCHITECTED QUEUEING LANGUAGE  Once a packet is enqueued, the producer signals the doorbell  Consumers are not required to wait on the doorbell – the consumer could instead be polling.  The doorbell is not the synchronization mechanism (the shared memory updates ensure the synchronization).  Packets are read and dispatched for execution from the queue in order, but may complete in any order.  There is no guarantee that more than one packet will be processed in parallel at a time  There may be many queues. A single agent may also consume from several queues.  A packet processing agent may also enqueue packets. © Copyright 2012 HSA Foundation. All Rights Reserved. 29
  • 30. POTENTIAL MULTI-PRODUCER ALGORITHM // Read the current queue write offset tmp_WriteOffset = WriteOffset; // wait until the queue is no longer full. while(tmp_WriteOffset == ReadOffset + Size) {} // Atomically bump the WriteOffset if (WriteOffset.compare_exchange_strong(tmp_WriteOffset, tmp_WriteOffset + 1, std::memory_order_acquire){ // calculate index uint32_t index = tmp_WriteOffset & (Size -1); // copy over the packet, the format field is INVALID BaseAddress[index] = pkt; // Update format field with release semantics BaseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release); // ring doorbell, with release semantics (could also amortize over multiple packets) hsa_ring_doorbell(tmp_WriteOffset+1); } © Copyright 2012 HSA Foundation. All Rights Reserved. 30
  • 31. POTENTIAL CONSUMER ALGORITHM // spin while empty (could also perform low-power wait on doorbell) while (BaseAddress[ReadOffset & (Size - 1)].hdr.format == INVALID) { } // calculate the index uint32_t index = ReadOffset & (Size - 1); // copy over the packet pkt = BaseAddress[index]; // set the format field to invalid BaseAddress[index].hdr.format.store(INVALID, std::memory_order_relaxed); // Update the readoffset ReadOffset.store(ReadOffset + 1, std::memory_order_release); © Copyright 2012 HSA Foundation. All Rights Reserved. 31
  • 33. DISPATCH PACKET © Copyright 2012 HSA Foundation. All Rights Reserved. 33  Packets come in two main types (Dispatch and Barrier), with architected layouts  Dispatch packet is the most common type of packet  Contains  Pointer to the kernel  Pointer to the arguments  WorkGroupSize (x,y,z)  gridSize(x,y,z)  And more……  Packets contain an additional “barrier” flag. When the barrier flag is set, no other packets will be launched until all previously launched packets from this queue have completed.
  • 34. DISPATCH PACKET © Copyright 2012 HSA Foundation. All Rights Reserved. 34 Offset Format Field Name Description 0 uint32_t format:8 AQL_FORMAT: 0=INVALID, 1=DISPATCH, 2=DEPEND, others reserved barrier:1 If set then processing of packet will only begin when all preceding packets are complete. acquireFenceScope:2 Determines the scope and type of the memory fence operation applied before the job is dispatched. releaseFenceScope:2 Determines the scope and type of the memory fence operation applied after kernel completion but before the job is completed. invalidateInstructionCache:1 Acquire fence additionally applies to any instruction cache(s). invalidateROImageCache:1 Acquire fence additionally applies to any read-only image cache(s). dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3. reserved:15 4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items). 6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items). 8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items). 10 uint16_t reserved2 12 uint32_t gridSize.x x dimension of grid (measured in work-items). 16 uint32_t gridSize.y y dimension of grid (measured in work-items). 20 uint32_t gridSize.z z dimension of grid (measured in work-items). 24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item). 28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group). 32 uint64_t kernelObjectAddress Address of an object in memory that includes an implementation-defined executable ISA image for the kernel. 40 uint64_t kernargAddress Address of memory containing kernel arguments. 48 uint64_t reserved3 56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
  • 35. BARRIER PACKET  Used for specifying dependences between packets  HSA will not launch any further packets from this queue until the barrier packet signal conditions are met  Used for specifying dependences on packets dispatched from any queue.  Execution phase completes only when all of the dependent signals (up to five) have been signaled (with the value of 0).  Or if an error has occurred in one of the packets upon which we have a dependence. © Copyright 2012 HSA Foundation. All Rights Reserved. 35
  • 36. BARRIER PACKET © Copyright 2012 HSA Foundation. All Rights Reserved. 36 Offset Format Field Name Description 0 uint32_t format:8 AQL_FORMAT: 0=INVALID, 1=DISPATCH, 2=DEPEND, others reserved barrier:1 If set then processing of packet will only begin when all preceding packets are complete. acquireFenceScope:2 Determines the scope and type of the memory fence operation applied before the job is dispatched. releaseFenceScope:2 Determines the scope and type of the memory fence operation applied after kernel completion but before the job is completed. invalidateInstructionCache:1 Acquire fence additionally applies to any instruction cache(s). invalidateROImageCache:1 Acquire fence additionally applies to any read-only image cache(s). dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3. reserved:15 4 uint32_t reserved2 8 uint64_t depSignal0 Address of dependent signaling objects to be evaluated by the packet processor. 16 uint64_t depSignal1 24 uint64_t depSignal2 32 uint64_t depSignal3 40 uint64_t depSignal4 48 uint64_t reserved3 56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
  • 37. DEPENDENCES  A user may never assume more than one packet is being executed by an HSA agent at a time.  Implications:  Packets can’t poll on shared memory values which will be set by packets issued from other queues, unless the user has ensured the proper ordering.  To ensure all previous packets from a queue have been completed, use the Barrier bit.  To ensure specific packets from any queue have completed, use the Barrier packet. © Copyright 2012 HSA Foundation. All Rights Reserved. 37
  • 38. HSA QUEUEING, PACKET EXECUTION
  • 39. PACKET EXECUTION  Launch phase  Initiated when launch conditions are met  All preceeding packets in the queue must have exited launch phase  If the barrier bit in the packet header is set, then all preceding packets in the queue must have exited completion phase  Active phase  Execute the packet  Barrier packets remain in Active phase until conditions are met.  Completion phase  First step is memory release fence – make results visible.  CompletionSignal field is then signaled with a decrementing atomic. © Copyright 2012 HSA Foundation. All Rights Reserved. 39
  • 40. PACKET EXECUTION © Copyright 2012 HSA Foundation. All Rights Reserved. 40 Pkt1 Launch Pkt2 Launch Pkt1 Execute Pkt2 Execute Pkt1 Complete Pkt3 Launch (barrier=1) Pkt2 Complete Pkt3 Execute Time Pkt3 stays in Launch phase until all packets have completed.
  • 41. PUTTING IT ALL TOGETHER (FFT) © Copyright 2012 HSA Foundation. All Rights Reserved. 41 Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Barrier Barrier X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7] Time
  • 42. PUTTING IT ALL TOGETHER © Copyright 2012 HSA Foundation. All Rights Reserved. 42 AQL Pseudo Code // Send the packets to do the first stage. aql_dispatch(pkt1); aql_dispatch(pkt2); // Send the next two packets, setting the barrier bit so we //know packets 1 &2 will be complete before 3 and 4 are //launched. aql_dispatch_with _barrier_bit(pkt3); aql_dispatch(pkt4); // Same as above (make sure 3 & 4 are done before issuing 5 //& 6) aql_dispatch_with_barrier_bit(pkt5); aql_dispatch(pkt6); // This packet will notify us when 5 & 6 are complete) aql_dispatch_with_barrier_bit(finish_pkt);
  • 43. QUESTIONS? © Copyright 2012 HSA Foundation. All Rights Reserved. 43