AMD_11th_Intl_SoC_Conf_UCI_Irvine

Platform Coherency and SoC Verification Challenges
PANKAJ SINGH, CHETHAN-RAJ M , PRAKASH RAGHAVENDRA, ANINDYASUNDAR NANDI, DIBYENDU DAS AND TONY TYE
THE 11TH INTERNATIONAL SYSTEM-ON-CHIP (SOC) CONFERENCE, EXHIBIT, AND WORKSHOPS, OCTOBER 2013, IRVINE, CALIFORNIA
WWW.SOCCONFERENCE.COM
ACKNOWLEDGEMENTS:

PHIL ROGERS AMD CORPORATE FELLOW , ROY JU & BEN SANDER SR FELLOW
NARENDRA KAMAT, PRAVEEN DONGARA AND LEE HOWES

TODAY’S TOPICS
A New Parallel Computing Platform

– Heterogeneous System Architecture
Opportunities, Benefits and Feature Roadmap

Kaveri Platform Coherency
Shared memory, Platform atomics

Kaveri Verification Approach
SoC Verification Challenges and Solutions

1
HSA
2

2
KAVERI
PLATFORM
COHERENCY

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

3

4

SoC
KAVERI
VERIFICATION VERIFICATION

A New Parallel Computing
Platform – Heterogeneous
System Architecture (HSA)

1

2

HSA

KAVERI
PLATFORM
COHERENCY


3

4

SoC
KAVERI

APU: ACCELERATED PROCESSING UNIT

The APU is a great advance compared
to previous platforms

CPU pair

Combines scalar processing on CPU
with parallel processing on the GPU and
high-bandwidth access to memory

Challenge: How do we make it even better going forward?
 Easier to program
 Easier to optimize
 Easier to load balance
 Higher performance
 Lower power

4


GPU SIMD

THE HSA OPPORTUNITY ON MODERN APPLICATIONS

PROBLEM

SOLUTION
 HSA + Libraries =
productivity & performance with low power

Developer
Return

Few M
HSA
coders

(Differentiation in
performance,
reduced power,
features,
time to market)

Few 100Ks
HSA
apps

 GPU/HW blocks hard to program
 Not all workloads accelerate

Wide range of
differentiated
experiences

PROBLEM
 Historically, developers program CPUs
~20+M*
CPU
coders

~4M
apps

Good user
experiences

Developer Investment
(Effort, time, new skills)
*IDC

5


Tens of Ks
GPU
coders

Few
hundred
apps

Significant
niche
value

HSA AND ITS BENEFITS
HSA IS A COMPUTING PLATFORM THAT DRIVES NEW CLASS OF APPLICATIONS
App-Accelerated
Software Applications
Graphics Workloads
Data-Parallel Workloads
Serial and Task-Parallel Workloads
HSA is an enabler of APU’s higher performance and power efficiency
Our industry-leading APUs speed up applications beyond graphics
CPU and GPU (APUs) work cooperatively together directly in system memory
Makes programming the APU as easy as C++
Improves Performance per watt

6


Ref [1]

HSA EFFICIENCY IMPROVEMENT (AN EXAMPLE)
Improves Power and Performance: Move application from CPU to GPU, remove data copies,
and reduce launch time
35 W

Measured Power

25 fps
20 fps

30 W
25 W

DRAM
NB+GPU

DRAM

15 fps

20 W
NB+GPU

15 W
10 W

Measured Perf

10 fps
CPU Cores
CPU Cores

5W

5 fps

CPU+GPU

0 fps

0W
CPU

CPU

Simulate removing memory copies:

1.32 X

CPU+GPU

 1.11 * 2.88 * 1.32 = 4.22 X Better Energy Efficiency
 Easier to Program + Remove Copies

ENERGY COMPUTATION BREAKDOWN: MOTIONDSP 720P VIDEO CLEAN-UP

7


Ref [1]

HETEROGENEOUS SYSTEM ARCHITECTURE FEATURE ROADMAP

Physical
Integration

Optimized
Platforms

Integrate CPU & GPU
in silicon

Architectural
Integration

System
Integration

Unified Address Space
for CPU and GPU

GPU compute
context switch

Unified Memory
Controller

User Mode Schedulng

GPU uses pageable
system memory via
CPU pointers

GPU graphics
pre-emption

Common
Manufacturing
Technology

8

GPU Compute C++
support

Bi-Directional Power
Mgmt between CPU
and GPU

Fully coherent memory
between CPU & GPU

Quality of Service


PLATFORM COHERENCY

1

2

HSA

KAVERI
PLATFORM
COHERENCY


3

4

SoC
KAVERI

KAVERI SOC – ENABLING SHARED MEMORY AND PLATFORM
ATOMICS
Shared memory accesses between the CPU and
GPU happens via ‘system memory’.
– Corresponds to the notion of shared virtual memory
(SVM) in OpenCL 2.0, available via clSVMalloc()
call. With SVM, CPUs and GPUs can share an
address space and share the pointer to the same
memory location.
– The compiler supports SVM and atomics calls that
work across the CPU-GPU boundary.
– System-memory accesses may go one of three
paths
 If coherence with CPU is not required:
GARLIC path
 If kernel-granularity coherence with CPU is
required: ONION bus path
 If instruction-granularity coherence with CPU
is required: Bypass L2 via ONION+ bus (required
by atomics)

10


CONCURRENT STACK PUSH USING ATOMIC COMPARE-ANDEXCHANGE (AN EXAMPLE)
Each CPU thread and each GPU workitem execute the following code concurrently:
 The code shows an example implementation of a concurrent stack’s “push” operation.
 The “compare_exchange_strong” is an atomic call that ensures only one of the CPU/GPU
thread/workitem succeeds in updating the “head” pointer of the stack stored in list[0]

do {
head = list[0]; //redundant because the atomic call updates head on failure
list[i] = head;
} while (!atomic_compare_exchange_strong(&list[0], &head,i));
0

3

0

2

1

1

2

2

3

3

5

3

5

4

4

5

-1

…

5

i=2 and i=4 contest for ACE
(List: 3 (head)->5->-1)

99

Time Instant

Workitem i=2

…
99

List after i=2 wins!
(List: 2 (head)->3->5->-1)

Workitem i=4

Before ACE

head=3, list[2]=3 head=3,list[4]=3

ACE

Wins!

After ACE completes list[0]=2
11

-1


Loses and goes back & retries
list[0]=2

IMPLEMENTING PLATFORM ATOMICS FOR KAVERI
 The compiler has implemented these atomics (per OpenCL 2.0 standards) for Kaveri.
 The key issue in implementing these atomics is to make sure that both CPU and GPU see
the shared memory in “coherent” state.
 The coherency is implemented using the ONION+ memory path and using the GPU ISA
instructions, which can invalidate/bypass L1/L2 caches selectively from the GPU side and
snoop to invalidate the CPU caches. This support is provided in the KV SOC.
 For example: atomic_load with acquire semantics generates code on the GPU side as
shown (in Kaveri L2 is always bypassed for coherent access). Similarly, atomic_store with
release semantics generates the GPU ISA given later.

1. load with glc=1
2. S_waitcnt 0
3. buffer_wbinv_vol

// bypass the L1 cache

1. s_waitcnt 0
2. store with glc=0

// wait for any previous memop to complete
// L1 is a write-through cache, so write onto
memory as L2 is bypassed
// prevent any following memop to move up

3. s_waitcnt 0;

// wait for the load to complete

// invalidate L1 so that any following load reads from memory

 OpenCL 2.0 and C11 atomics support various kinds of memory_scope & memory_ordering

12


KAVERI SOC VERIFICATION
APPROACH

1

2

HSA

KAVERI
PLATFORM
COHERENCY


3

4

SoC
KAVERI

TRADITIONAL VERIFICATION AND SOC CHALLENGE
CPU
NorthBridge

DRAM
Model

Graphics
model
GFX

SouthBridge
BFM

CPU-BASED VERIFICATION
 Assembly based input
 Memory image of x86 machine code is
preloaded into DRAM model
 CPU fetches instructions from DRAM
and executes them

GPU-BASED VERIFICATION
 Higher language (C/C++)
 BFM model used across PCIe-based
interface to inject data
 GPU sends requests to DRAM over 2
paths: coherent and non-coherent

SoC Verification Challenge
 Layer of complexity due to HSA coherency environment.
 SoC GPU needs to be programmed, which requires host
 SoC CPU can be used the host. However, running the same host software stack results
in huge simulation time
 One approach is Mailbox:
 Inefficient due to lack of CPU-GPU interaction, longer run time.
 GPU-focused verification not suitable for CPU-GPU interaction (HSA)
14


SOC VERIFICATION METHODOLOGY: TEST FLOW
GPU Test
Test (Open
CL)

CPU Test

One Thread
[ Driver
CPU]

Running driver code on simulated CPU is
impossible due to simulation run-times.

Intent Capture is a mechanism to allow existing
discrete GPU graphics tests to execute on the CPU
in a Heterogeneous APU simulation.

Intent Capture
Capture

Other
Threads

sp3 shader

Output

Replay()

CX
Shell

.sim
memory
image

APU RTL
Sim

Test
Output

Runs

 The memory accesses and configuration writes from the test are extracted into C function calls

 Intent Capture performs this activity and encapsulates the GPU test into a function called Replay.
 On CPU side, one thread runs Replay function while other threads execute the CPU side of the test.
 Composite test (CPU test + generated FusionReplay function) is compiled using cxshell to generate a .sim
memory image

15


Ref [4]

POWER MANAGEMENT: BAPM

Multiple
Boost
Pstates

Pb0

...
Core cores
Pwr @ Pbase
Core
Pwr

Core
Pwr

Rest
of
APU
Pwr

Die Temp 

APU Pwr 

Pbx

Rest
of
APU
Pwr

App1 with

Rest
of
APU
Pwr

App2 with

Low CAC Allcores active

SWP0

P1

SWP1

…

…

HW
View

App3 with

High CAC
Med CAC HalfAll-cores active cores active

P0/Pbase

SW/OS
View

ILLUSTRATION WITH
CPU-CENTRIC SCENARIO

Ref[2]

CPU Core1

CPU Core2

Compute
Unit
Power
Monitor
calculates
CPU
Power

If Temp > Limit, reduce power allocation





Firmware
converts power
into
temperature
estimates

Compare
Temperature to
Limit & adjust
Voltage/Frequency

GPU
Power
Monitor
calculates
GPU
Power


GPU Core2

If Temp < Limit, increase power allocation

In a multi-core design, apps running on CPU/GPU cores may consume less power
Power-efficient algorithms exploit this power headroom for performance
The GPU can borrow power credit from the CPU in GPU-centric scenarios and vice versa
16

GPU Core1

BAPM VERIFICATION APPROACH @ SOC
•
CPU Core1

CPU
Power
Monitor

CPU Core2

CPU
Power
Monitor

•
•

NB

CAC
Manager

•
•
SMU F/W

GPU Core1

GPU
Power
Monitor

•

GPU Core2

GPU
Power
Monitor

•
•

Developed high and low power consuming CPU
patterns based on micro-architecture and power
analysis.
Interleaved high and low power patterns in random
stimulus
Used an Irritator to manipulate the credits sent to
CAC manager at times to hit corner cases like
back-to-back boost/throttle

Modeled F/W algorithm using a simple BFM
Added CSR framework to drive read/write to CAC
manager
A very few sanity tests run with real f/w loaded
through backdoor to check the end-to-end flow.

Used irritators to model GPU power credit
CPU-centric
reporting instead of running GPU applications.
GPU power monitor verified at GPU IP level

Efficient Coverage-driven random verification
 CPU boosted because of GPU giving away credits and vice versa
 Crosses of CPU/GPU events and effect on BAPM
17


Multiple
Boost
Pstates

SOC VERIFICATION
CHALLENGES & SOLUTION

1

2

HSA

KAVERI
PLATFORM
COHERENCY


3

4

SoC
KAVERI

TEST STIMULUS REUSE AND PORTING TO SOC
Tool and flow differences/set-up across IP and SOC, make stimulus reuse difficult.
Using functional model to simulate IP[RTL] in SoC scenario
for IP test development and easy porting to SoC

cMemory
Memory Model

Test setup update @ IP level to support test run with SOC
as a new target

Export suite, test key

MPMM

MEMIO
Memory
Model

IP2SoC
script
UNB Perf options

CPU to GPU access

GPU C
Model

CPU C
Model/RTL
Bus Unit

A simple HSA SOC test with 1 Rd-WR in RTL takes about 18
hours whereas it is <1 hour on the Heterogeneous C model

Intent Capture and Playback methodology

DV Test

GPU C
Model

Test
Output

Common test options

reports
sim output
run_job command-line options: directories
GNB,XNB,UNB

Goal: Improve Quality, Reduce development time
19


Run/Execute
Regression

NB/DCT prog. options

Test
Output

APU

Create job spec
[ip2soc –merge]

Test setup update such as configuration changes, test stimulus
defines allowed IP test to be reused.

Capture
Output
Replay
Capture
Output

Memory config
Perf_options.yml

HW-SW INTERACTION: MODELING AND ABSTRACTION
HW-SW INTERACTION : MODELING & ABSTRACTION
Complex and evolving logic moving from hardware to firmware for better controllability. Challenges:
 Firmware algorithms are compute-intensive and often developed late in design cycle.
 Additional challenge to Verification in terms of load and execute time of the software.
Connected Standby Verification Approach

 Model the relevant section of the software using BFM with proper interface to the hardware
 Add sufficient controllability to stress different paths of the BFM model - find coverage
 Adaptive stimulus based on coverage of the BFM/state-machine

Goals: Improve Quality, Reduce development time
20


ADAPTIVE STIMULUS
Typically, power management transitions kick off after active code execution stops. This results in deeper
corner cases associated with thread-level coordination in multi-core design.
 Predicting occurrences of deeper phases and targeting those by code/stimulus is difficult.

 Define the power management modes as state machines - each state having granular phases including
thread specific information.
 Dynamic irritator monitors these state transitions, inserts random/directed asynchronous events (like different
sorts of interrupts, probes, warmreset) and updates a scoreboard.
 Events are generated very close to the relevant points - provides great controllability.
 Dynamic irritator adapts based on scoreboard statistics - eventually putting more weightage to the less
frequently covered <state> X <event> buckets.

21


CONSTRAINT RANDOM STIMULUS AND RANDOMIZATION AT SOC
Random Initial
States

S11

S21

St

St

S23

St

St

Ref[3]

Complex SoC requires Randomization at different levels
SOC Constraints
IP Constraints
Register
Fuse
Modes: LFBR,
BfD,long_init/
unfused test

Run

Build
Randomization
utility
Package
level info

RandomConfig
executable

Time t=0 [config values
Import value after reset
CMD line
options

22


OVERCOMING LIMITATIONS OF GATE-LEVEL SIMULATION
Challenges with Netlist simulation :
Longer run-times
Longer debug times


Approach to minimize runtime: Compute intensive RTL and associated verification components must be
replaced with a less intensive test-vector applicator : Apply test vectors directly from FSDB file.
Create Gatesim files
(gatesim.v,forces.v )

Run RTL
simulation,get FSDB

Build w Netlist + Gatesim
files + TB to drive stimulus
from FSDB

10x runtime optimization over traditional approach.
 Approach to minimize Debug effort: Verdi NPI based Methodology to automate Debug:

Ref [5]

23



Run Netlist
sims(with
FSDB dump)

THANKYOU


REFERENCES
[1] A New Parallel Computing Platform – HSA, CTHPC 2013 Keynote
Speech, Roy Ju, AMD Senior Fellow
[2] AMD APUs :Dynamic Power Management Techniques, DAC 2013.
Praveen Dongara, System Architect
[3] Wilson Research Group-MGC 2013.
[4] Kaveri DTP. Internal Document.
[5] Innovative Approach to Overcome Limitations of Netlist Simulation,
SUNG 2013. Prodip K, Pankaj S,Meera M, Narendran K

25


GLOSSARY
 GPU – Graphics processing unit
 APU -- Accelerated Processing Unit
 Open CL™ -- Open Computing Language
 TDP – Thermal Design power – a measure of a design infrastructure’s ability to cool a device
 AMD Turbo Core Technology – AMD boost mechanism
 BIAPM -- Bi-directional Application Power Management.
 Cac -- Capacitance AC switching, measures switching activity of a cluster
 TDP -- Thermal Design Power, represents the average thermal dissipation power required to cool the design
 Pstate -- Processor performance state
 GARLIC -- Graphic Accelerated Reduced Latency Integrated Channel
 ONION -- On-chip Northbridge to I/O Noncoherent bus
 FSDB – Fast Signal Database

26


BACKUP

27


DYNAMIC FINE-GRAINED POWER TRANSFERS
The dynamically calculated temperature of
each core and the GPU enables the
operating point of each to be dynamically
balanced in-order to maximize
performance within temperature limits.

Low activity in one core enables it to be a
thermal sink for a more active core

100.0

100.0

95.0

95.0

90.0

95.0

90.0

85.0

90.0

85.0

80.0

85.0

80.0

75.0

80.0

75.0

GPU-centric

28

100.0


75.0

Balanced

Ref [2]

CPU-centric

Disclaimer
The information presented in this document is for informational purposes only and may contain technical inaccuracies,
omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not
limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases,
product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD
assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this
information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD makes no representations or warranties with respect to the contents hereof and assumes no responsibility for any
inaccuracies, errors or omissions that appear in this information.
AMD specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. In no event will AMD
be liable to any person for any direct, indirect, special or other consequential damages arising from the use of any information
contained herein, even if AMD is expressly advised of the possibility of such damages.

Trademark Attribution
AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United
States and/or other jurisdictions. Open CL and the Open CL logo are trademarks of Apple, Inc. and used by permission or
Khronos. Microsoft, Windows and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other
jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their
respective owners.
©2011 Advanced Micro Devices, Inc. All rights reserved.

29


AMD_11th_Intl_SoC_Conf_UCI_Irvine

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à AMD_11th_Intl_SoC_Conf_UCI_Irvine

Similaire à AMD_11th_Intl_SoC_Conf_UCI_Irvine (20)

Plus de Pankaj Singh

Plus de Pankaj Singh (9)

Dernier

Dernier (20)

AMD_11th_Intl_SoC_Conf_UCI_Irvine