1. Platform Coherency and SoC Verification Challenges
PANKAJ SINGH, CHETHAN-RAJ M , PRAKASH RAGHAVENDRA, ANINDYASUNDAR NANDI, DIBYENDU DAS AND TONY TYE
THE 11TH INTERNATIONAL SYSTEM-ON-CHIP (SOC) CONFERENCE, EXHIBIT, AND WORKSHOPS, OCTOBER 2013, IRVINE, CALIFORNIA
WWW.SOCCONFERENCE.COM
ACKNOWLEDGEMENTS:
PHIL ROGERS AMD CORPORATE FELLOW , ROY JU & BEN SANDER SR FELLOW
NARENDRA KAMAT, PRAVEEN DONGARA AND LEE HOWES
2. TODAY’S TOPICS
A New Parallel Computing Platform
– Heterogeneous System Architecture
Opportunities, Benefits and Feature Roadmap
Kaveri Platform Coherency
Shared memory, Platform atomics
Kaveri Verification Approach
SoC Verification Challenges and Solutions
1
HSA
2
2
KAVERI
PLATFORM
COHERENCY
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
3
4
SoC
KAVERI
VERIFICATION VERIFICATION
3. A New Parallel Computing
Platform – Heterogeneous
System Architecture (HSA)
1
2
HSA
KAVERI
PLATFORM
COHERENCY
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
3
4
SoC
KAVERI
VERIFICATION VERIFICATION
4. APU: ACCELERATED PROCESSING UNIT
The APU is a great advance compared
to previous platforms
CPU pair
Combines scalar processing on CPU
with parallel processing on the GPU and
high-bandwidth access to memory
Challenge: How do we make it even better going forward?
Easier to program
Easier to optimize
Easier to load balance
Higher performance
Lower power
4
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
GPU SIMD
5. THE HSA OPPORTUNITY ON MODERN APPLICATIONS
PROBLEM
SOLUTION
HSA + Libraries =
productivity & performance with low power
Developer
Return
Few M
HSA
coders
(Differentiation in
performance,
reduced power,
features,
time to market)
Few 100Ks
HSA
apps
GPU/HW blocks hard to program
Not all workloads accelerate
Wide range of
differentiated
experiences
PROBLEM
Historically, developers program CPUs
~20+M*
CPU
coders
~4M
apps
Good user
experiences
Developer Investment
(Effort, time, new skills)
*IDC
5
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Tens of Ks
GPU
coders
Few
hundred
apps
Significant
niche
value
6. HSA AND ITS BENEFITS
HSA IS A COMPUTING PLATFORM THAT DRIVES NEW CLASS OF APPLICATIONS
App-Accelerated
Software Applications
Graphics Workloads
Data-Parallel Workloads
Serial and Task-Parallel Workloads
HSA is an enabler of APU’s higher performance and power efficiency
Our industry-leading APUs speed up applications beyond graphics
CPU and GPU (APUs) work cooperatively together directly in system memory
Makes programming the APU as easy as C++
Improves Performance per watt
6
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Ref [1]
7. HSA EFFICIENCY IMPROVEMENT (AN EXAMPLE)
Improves Power and Performance: Move application from CPU to GPU, remove data copies,
and reduce launch time
35 W
Measured Power
25 fps
20 fps
30 W
25 W
DRAM
NB+GPU
DRAM
15 fps
20 W
NB+GPU
15 W
10 W
Measured Perf
10 fps
CPU Cores
CPU Cores
5W
5 fps
CPU+GPU
0 fps
0W
CPU
CPU
Simulate removing memory copies:
1.32 X
CPU+GPU
1.11 * 2.88 * 1.32 = 4.22 X Better Energy Efficiency
Easier to Program + Remove Copies
ENERGY COMPUTATION BREAKDOWN: MOTIONDSP 720P VIDEO CLEAN-UP
7
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Ref [1]
8. HETEROGENEOUS SYSTEM ARCHITECTURE FEATURE ROADMAP
Physical
Integration
Optimized
Platforms
Integrate CPU & GPU
in silicon
Architectural
Integration
System
Integration
Unified Address Space
for CPU and GPU
GPU compute
context switch
Unified Memory
Controller
User Mode Schedulng
GPU uses pageable
system memory via
CPU pointers
GPU graphics
pre-emption
Common
Manufacturing
Technology
8
GPU Compute C++
support
Bi-Directional Power
Mgmt between CPU
and GPU
Fully coherent memory
between CPU & GPU
Quality of Service
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
10. KAVERI SOC – ENABLING SHARED MEMORY AND PLATFORM
ATOMICS
Shared memory accesses between the CPU and
GPU happens via ‘system memory’.
– Corresponds to the notion of shared virtual memory
(SVM) in OpenCL 2.0, available via clSVMalloc()
call. With SVM, CPUs and GPUs can share an
address space and share the pointer to the same
memory location.
– The compiler supports SVM and atomics calls that
work across the CPU-GPU boundary.
– System-memory accesses may go one of three
paths
If coherence with CPU is not required:
GARLIC path
If kernel-granularity coherence with CPU is
required: ONION bus path
If instruction-granularity coherence with CPU
is required: Bypass L2 via ONION+ bus (required
by atomics)
10
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
11. CONCURRENT STACK PUSH USING ATOMIC COMPARE-ANDEXCHANGE (AN EXAMPLE)
Each CPU thread and each GPU workitem execute the following code concurrently:
The code shows an example implementation of a concurrent stack’s “push” operation.
The “compare_exchange_strong” is an atomic call that ensures only one of the CPU/GPU
thread/workitem succeeds in updating the “head” pointer of the stack stored in list[0]
do {
head = list[0]; //redundant because the atomic call updates head on failure
list[i] = head;
} while (!atomic_compare_exchange_strong(&list[0], &head,i));
0
3
0
2
1
1
2
2
3
3
5
3
5
4
4
5
-1
…
5
i=2 and i=4 contest for ACE
(List: 3 (head)->5->-1)
99
Time Instant
Workitem i=2
…
99
List after i=2 wins!
(List: 2 (head)->3->5->-1)
Workitem i=4
Before ACE
head=3, list[2]=3 head=3,list[4]=3
ACE
Wins!
After ACE completes list[0]=2
11
-1
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Loses and goes back & retries
list[0]=2
12. IMPLEMENTING PLATFORM ATOMICS FOR KAVERI
The compiler has implemented these atomics (per OpenCL 2.0 standards) for Kaveri.
The key issue in implementing these atomics is to make sure that both CPU and GPU see
the shared memory in “coherent” state.
The coherency is implemented using the ONION+ memory path and using the GPU ISA
instructions, which can invalidate/bypass L1/L2 caches selectively from the GPU side and
snoop to invalidate the CPU caches. This support is provided in the KV SOC.
For example: atomic_load with acquire semantics generates code on the GPU side as
shown (in Kaveri L2 is always bypassed for coherent access). Similarly, atomic_store with
release semantics generates the GPU ISA given later.
1. load with glc=1
2. S_waitcnt 0
3. buffer_wbinv_vol
// bypass the L1 cache
1. s_waitcnt 0
2. store with glc=0
// wait for any previous memop to complete
// L1 is a write-through cache, so write onto
memory as L2 is bypassed
// prevent any following memop to move up
3. s_waitcnt 0;
// wait for the load to complete
// invalidate L1 so that any following load reads from memory
OpenCL 2.0 and C11 atomics support various kinds of memory_scope & memory_ordering
12
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
14. TRADITIONAL VERIFICATION AND SOC CHALLENGE
CPU
NorthBridge
DRAM
Model
Graphics
model
GFX
SouthBridge
BFM
CPU-BASED VERIFICATION
Assembly based input
Memory image of x86 machine code is
preloaded into DRAM model
CPU fetches instructions from DRAM
and executes them
GPU-BASED VERIFICATION
Higher language (C/C++)
BFM model used across PCIe-based
interface to inject data
GPU sends requests to DRAM over 2
paths: coherent and non-coherent
SoC Verification Challenge
Layer of complexity due to HSA coherency environment.
SoC GPU needs to be programmed, which requires host
SoC CPU can be used the host. However, running the same host software stack results
in huge simulation time
One approach is Mailbox:
Inefficient due to lack of CPU-GPU interaction, longer run time.
GPU-focused verification not suitable for CPU-GPU interaction (HSA)
14
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
15. SOC VERIFICATION METHODOLOGY: TEST FLOW
GPU Test
Test (Open
CL)
CPU Test
One Thread
[ Driver
CPU]
Running driver code on simulated CPU is
impossible due to simulation run-times.
Intent Capture is a mechanism to allow existing
discrete GPU graphics tests to execute on the CPU
in a Heterogeneous APU simulation.
Intent Capture
Capture
Other
Threads
sp3 shader
Output
Replay()
CX
Shell
.sim
memory
image
APU RTL
Sim
Test
Output
Runs
The memory accesses and configuration writes from the test are extracted into C function calls
Intent Capture performs this activity and encapsulates the GPU test into a function called Replay.
On CPU side, one thread runs Replay function while other threads execute the CPU side of the test.
Composite test (CPU test + generated FusionReplay function) is compiled using cxshell to generate a .sim
memory image
15
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Ref [4]
16. POWER MANAGEMENT: BAPM
Multiple
Boost
Pstates
Pb0
...
Core cores
Pwr @ Pbase
Core
Pwr
Core
Pwr
Rest
of
APU
Pwr
Die Temp
APU Pwr
Pbx
Rest
of
APU
Pwr
App1 with
Rest
of
APU
Pwr
App2 with
Low CAC Allcores active
SWP0
P1
SWP1
…
…
HW
View
App3 with
High CAC
Med CAC HalfAll-cores active cores active
P0/Pbase
SW/OS
View
ILLUSTRATION WITH
CPU-CENTRIC SCENARIO
Ref[2]
CPU Core1
CPU Core2
Compute
Unit
Power
Monitor
calculates
CPU
Power
If Temp > Limit, reduce power allocation
Firmware
converts power
into
temperature
estimates
Compare
Temperature to
Limit & adjust
Voltage/Frequency
GPU
Power
Monitor
calculates
GPU
Power
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
GPU Core2
If Temp < Limit, increase power allocation
In a multi-core design, apps running on CPU/GPU cores may consume less power
Power-efficient algorithms exploit this power headroom for performance
The GPU can borrow power credit from the CPU in GPU-centric scenarios and vice versa
16
GPU Core1
17. BAPM VERIFICATION APPROACH @ SOC
•
CPU Core1
CPU
Power
Monitor
CPU Core2
CPU
Power
Monitor
•
•
NB
CAC
Manager
•
•
SMU F/W
GPU Core1
GPU
Power
Monitor
•
GPU Core2
GPU
Power
Monitor
•
•
Developed high and low power consuming CPU
patterns based on micro-architecture and power
analysis.
Interleaved high and low power patterns in random
stimulus
Used an Irritator to manipulate the credits sent to
CAC manager at times to hit corner cases like
back-to-back boost/throttle
Modeled F/W algorithm using a simple BFM
Added CSR framework to drive read/write to CAC
manager
A very few sanity tests run with real f/w loaded
through backdoor to check the end-to-end flow.
Used irritators to model GPU power credit
CPU-centric
reporting instead of running GPU applications.
GPU power monitor verified at GPU IP level
Efficient Coverage-driven random verification
CPU boosted because of GPU giving away credits and vice versa
Crosses of CPU/GPU events and effect on BAPM
17
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Multiple
Boost
Pstates
19. TEST STIMULUS REUSE AND PORTING TO SOC
Tool and flow differences/set-up across IP and SOC, make stimulus reuse difficult.
Using functional model to simulate IP[RTL] in SoC scenario
for IP test development and easy porting to SoC
cMemory
Memory Model
Test setup update @ IP level to support test run with SOC
as a new target
Export suite, test key
MPMM
MEMIO
Memory
Model
IP2SoC
script
UNB Perf options
CPU to GPU access
GPU C
Model
CPU C
Model/RTL
Bus Unit
A simple HSA SOC test with 1 Rd-WR in RTL takes about 18
hours whereas it is <1 hour on the Heterogeneous C model
Intent Capture and Playback methodology
DV Test
GPU C
Model
Test
Output
Common test options
reports
sim output
run_job command-line options: directories
GNB,XNB,UNB
Goal: Improve Quality, Reduce development time
19
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Run/Execute
Regression
NB/DCT prog. options
Test
Output
APU
Create job spec
[ip2soc –merge]
Test setup update such as configuration changes, test stimulus
defines allowed IP test to be reused.
Capture
Output
Replay
Capture
Output
Memory config
Perf_options.yml
20. HW-SW INTERACTION: MODELING AND ABSTRACTION
HW-SW INTERACTION : MODELING & ABSTRACTION
Complex and evolving logic moving from hardware to firmware for better controllability. Challenges:
Firmware algorithms are compute-intensive and often developed late in design cycle.
Additional challenge to Verification in terms of load and execute time of the software.
Connected Standby Verification Approach
Model the relevant section of the software using BFM with proper interface to the hardware
Add sufficient controllability to stress different paths of the BFM model - find coverage
Adaptive stimulus based on coverage of the BFM/state-machine
Goals: Improve Quality, Reduce development time
20
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
21. ADAPTIVE STIMULUS
Typically, power management transitions kick off after active code execution stops. This results in deeper
corner cases associated with thread-level coordination in multi-core design.
Predicting occurrences of deeper phases and targeting those by code/stimulus is difficult.
Define the power management modes as state machines - each state having granular phases including
thread specific information.
Dynamic irritator monitors these state transitions, inserts random/directed asynchronous events (like different
sorts of interrupts, probes, warmreset) and updates a scoreboard.
Events are generated very close to the relevant points - provides great controllability.
Dynamic irritator adapts based on scoreboard statistics - eventually putting more weightage to the less
frequently covered <state> X <event> buckets.
Goals: Improve Quality, Reduce development time
21
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
22. CONSTRAINT RANDOM STIMULUS AND RANDOMIZATION AT SOC
Random Initial
States
S11
S21
St
St
S23
St
St
Ref[3]
Complex SoC requires Randomization at different levels
SOC Constraints
IP Constraints
Register
Fuse
Modes: LFBR,
BfD,long_init/
unfused test
Run
Build
Randomization
utility
Package
level info
RandomConfig
executable
Time t=0 [config values
Import value after reset
CMD line
options
Goals: Improve Quality, Reduce development time
22
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
23. OVERCOMING LIMITATIONS OF GATE-LEVEL SIMULATION
Challenges with Netlist simulation :
Longer run-times
Longer debug times
Approach to minimize runtime: Compute intensive RTL and associated verification components must be
replaced with a less intensive test-vector applicator : Apply test vectors directly from FSDB file.
Create Gatesim files
(gatesim.v,forces.v )
Run RTL
simulation,get FSDB
Build w Netlist + Gatesim
files + TB to drive stimulus
from FSDB
10x runtime optimization over traditional approach.
Approach to minimize Debug effort: Verdi NPI based Methodology to automate Debug:
Ref [5]
23
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
Goals: Improve Quality, Reduce development time
Run Netlist
sims(with
FSDB dump)
25. REFERENCES
[1] A New Parallel Computing Platform – HSA, CTHPC 2013 Keynote
Speech, Roy Ju, AMD Senior Fellow
[2] AMD APUs :Dynamic Power Management Techniques, DAC 2013.
Praveen Dongara, System Architect
[3] Wilson Research Group-MGC 2013.
[4] Kaveri DTP. Internal Document.
[5] Innovative Approach to Overcome Limitations of Netlist Simulation,
SUNG 2013. Prodip K, Pankaj S,Meera M, Narendran K
25
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
26. GLOSSARY
GPU – Graphics processing unit
APU -- Accelerated Processing Unit
Open CL™ -- Open Computing Language
TDP – Thermal Design power – a measure of a design infrastructure’s ability to cool a device
AMD Turbo Core Technology – AMD boost mechanism
BIAPM -- Bi-directional Application Power Management.
Cac -- Capacitance AC switching, measures switching activity of a cluster
TDP -- Thermal Design Power, represents the average thermal dissipation power required to cool the design
Pstate -- Processor performance state
GARLIC -- Graphic Accelerated Reduced Latency Integrated Channel
ONION -- On-chip Northbridge to I/O Noncoherent bus
FSDB – Fast Signal Database
26
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
28. DYNAMIC FINE-GRAINED POWER TRANSFERS
The dynamically calculated temperature of
each core and the GPU enables the
operating point of each to be dynamically
balanced in-order to maximize
performance within temperature limits.
Low activity in one core enables it to be a
thermal sink for a more active core
100.0
100.0
95.0
95.0
90.0
95.0
90.0
85.0
90.0
85.0
80.0
85.0
80.0
75.0
80.0
75.0
GPU-centric
28
100.0
| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
75.0
Balanced
Ref [2]
CPU-centric