SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Platform Coherency and SoC Verification Challenges
PANKAJ SINGH, CHETHAN-RAJ M , PRAKASH RAGHAVENDRA, ANINDYASUNDAR NANDI, DIBYENDU DAS AND TONY TYE
THE 11TH INTERNATIONAL SYSTEM-ON-CHIP (SOC) CONFERENCE, EXHIBIT, AND WORKSHOPS, OCTOBER 2013, IRVINE, CALIFORNIA
WWW.SOCCONFERENCE.COM
ACKNOWLEDGEMENTS:

PHIL ROGERS AMD CORPORATE FELLOW , ROY JU & BEN SANDER SR FELLOW
NARENDRA KAMAT, PRAVEEN DONGARA AND LEE HOWES
TODAY’S TOPICS
A New Parallel Computing Platform

– Heterogeneous System Architecture
Opportunities, Benefits and Feature Roadmap

Kaveri Platform Coherency
Shared memory, Platform atomics

Kaveri Verification Approach
SoC Verification Challenges and Solutions

1
HSA
2

2
KAVERI
PLATFORM
COHERENCY

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

3

4

SoC
KAVERI
VERIFICATION VERIFICATION
A New Parallel Computing
Platform – Heterogeneous
System Architecture (HSA)

1

2

HSA

KAVERI
PLATFORM
COHERENCY

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

3

4

SoC
KAVERI
VERIFICATION VERIFICATION
APU: ACCELERATED PROCESSING UNIT

The APU is a great advance compared
to previous platforms

CPU pair

Combines scalar processing on CPU
with parallel processing on the GPU and
high-bandwidth access to memory

Challenge: How do we make it even better going forward?
 Easier to program
 Easier to optimize
 Easier to load balance
 Higher performance
 Lower power

4

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

GPU SIMD
THE HSA OPPORTUNITY ON MODERN APPLICATIONS

PROBLEM

SOLUTION
 HSA + Libraries =
productivity & performance with low power

Developer
Return

Few M
HSA
coders

(Differentiation in
performance,
reduced power,
features,
time to market)

Few 100Ks
HSA
apps

 GPU/HW blocks hard to program
 Not all workloads accelerate

Wide range of
differentiated
experiences

PROBLEM
 Historically, developers program CPUs
~20+M*
CPU
coders

~4M
apps

Good user
experiences

Developer Investment
(Effort, time, new skills)
*IDC

5

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Tens of Ks
GPU
coders

Few
hundred
apps

Significant
niche
value
HSA AND ITS BENEFITS
HSA IS A COMPUTING PLATFORM THAT DRIVES NEW CLASS OF APPLICATIONS
App-Accelerated
Software Applications
Graphics Workloads
Data-Parallel Workloads
Serial and Task-Parallel Workloads
HSA is an enabler of APU’s higher performance and power efficiency
Our industry-leading APUs speed up applications beyond graphics
CPU and GPU (APUs) work cooperatively together directly in system memory
Makes programming the APU as easy as C++
Improves Performance per watt

6

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Ref [1]
HSA EFFICIENCY IMPROVEMENT (AN EXAMPLE)
Improves Power and Performance: Move application from CPU to GPU, remove data copies,
and reduce launch time
35 W

Measured Power

25 fps
20 fps

30 W
25 W

DRAM
NB+GPU

DRAM

15 fps

20 W
NB+GPU

15 W
10 W

Measured Perf

10 fps
CPU Cores
CPU Cores

5W

5 fps

CPU+GPU

0 fps

0W
CPU

CPU

Simulate removing memory copies:

1.32 X

CPU+GPU

 1.11 * 2.88 * 1.32 = 4.22 X Better Energy Efficiency
 Easier to Program + Remove Copies

ENERGY COMPUTATION BREAKDOWN: MOTIONDSP 720P VIDEO CLEAN-UP

7

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Ref [1]
HETEROGENEOUS SYSTEM ARCHITECTURE FEATURE ROADMAP

Physical
Integration

Optimized
Platforms

Integrate CPU & GPU
in silicon

Architectural
Integration

System
Integration

Unified Address Space
for CPU and GPU

GPU compute
context switch

Unified Memory
Controller

User Mode Schedulng

GPU uses pageable
system memory via
CPU pointers

GPU graphics
pre-emption

Common
Manufacturing
Technology

8

GPU Compute C++
support

Bi-Directional Power
Mgmt between CPU
and GPU

Fully coherent memory
between CPU & GPU

Quality of Service

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
PLATFORM COHERENCY

1

2

HSA

KAVERI
PLATFORM
COHERENCY

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

3

4

SoC
KAVERI
VERIFICATION VERIFICATION
KAVERI SOC – ENABLING SHARED MEMORY AND PLATFORM
ATOMICS
Shared memory accesses between the CPU and
GPU happens via ‘system memory’.
– Corresponds to the notion of shared virtual memory
(SVM) in OpenCL 2.0, available via clSVMalloc()
call. With SVM, CPUs and GPUs can share an
address space and share the pointer to the same
memory location.
– The compiler supports SVM and atomics calls that
work across the CPU-GPU boundary.
– System-memory accesses may go one of three
paths
 If coherence with CPU is not required:
GARLIC path
 If kernel-granularity coherence with CPU is
required: ONION bus path
 If instruction-granularity coherence with CPU
is required: Bypass L2 via ONION+ bus (required
by atomics)

10

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
CONCURRENT STACK PUSH USING ATOMIC COMPARE-ANDEXCHANGE (AN EXAMPLE)
Each CPU thread and each GPU workitem execute the following code concurrently:
 The code shows an example implementation of a concurrent stack’s “push” operation.
 The “compare_exchange_strong” is an atomic call that ensures only one of the CPU/GPU
thread/workitem succeeds in updating the “head” pointer of the stack stored in list[0]

do {
head = list[0]; //redundant because the atomic call updates head on failure
list[i] = head;
} while (!atomic_compare_exchange_strong(&list[0], &head,i));
0

3

0

2

1

1

2

2

3

3

5

3

5

4

4

5

-1

…

5

i=2 and i=4 contest for ACE
(List: 3 (head)->5->-1)

99

Time Instant

Workitem i=2

…
99

List after i=2 wins!
(List: 2 (head)->3->5->-1)

Workitem i=4

Before ACE

head=3, list[2]=3 head=3,list[4]=3

ACE

Wins!

After ACE completes list[0]=2
11

-1

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Loses and goes back & retries
list[0]=2
IMPLEMENTING PLATFORM ATOMICS FOR KAVERI
 The compiler has implemented these atomics (per OpenCL 2.0 standards) for Kaveri.
 The key issue in implementing these atomics is to make sure that both CPU and GPU see
the shared memory in “coherent” state.
 The coherency is implemented using the ONION+ memory path and using the GPU ISA
instructions, which can invalidate/bypass L1/L2 caches selectively from the GPU side and
snoop to invalidate the CPU caches. This support is provided in the KV SOC.
 For example: atomic_load with acquire semantics generates code on the GPU side as
shown (in Kaveri L2 is always bypassed for coherent access). Similarly, atomic_store with
release semantics generates the GPU ISA given later.

1. load with glc=1
2. S_waitcnt 0
3. buffer_wbinv_vol

// bypass the L1 cache

1. s_waitcnt 0
2. store with glc=0

// wait for any previous memop to complete
// L1 is a write-through cache, so write onto
memory as L2 is bypassed
// prevent any following memop to move up

3. s_waitcnt 0;

// wait for the load to complete

// invalidate L1 so that any following load reads from memory

 OpenCL 2.0 and C11 atomics support various kinds of memory_scope & memory_ordering

12

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
KAVERI SOC VERIFICATION
APPROACH

1

2

HSA

KAVERI
PLATFORM
COHERENCY

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

3

4

SoC
KAVERI
VERIFICATION VERIFICATION
TRADITIONAL VERIFICATION AND SOC CHALLENGE
CPU
NorthBridge

DRAM
Model

Graphics
model
GFX

SouthBridge
BFM

CPU-BASED VERIFICATION
 Assembly based input
 Memory image of x86 machine code is
preloaded into DRAM model
 CPU fetches instructions from DRAM
and executes them

GPU-BASED VERIFICATION
 Higher language (C/C++)
 BFM model used across PCIe-based
interface to inject data
 GPU sends requests to DRAM over 2
paths: coherent and non-coherent

SoC Verification Challenge
 Layer of complexity due to HSA coherency environment.
 SoC GPU needs to be programmed, which requires host
 SoC CPU can be used the host. However, running the same host software stack results
in huge simulation time
 One approach is Mailbox:
 Inefficient due to lack of CPU-GPU interaction, longer run time.
 GPU-focused verification not suitable for CPU-GPU interaction (HSA)
14

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
SOC VERIFICATION METHODOLOGY: TEST FLOW
GPU Test
Test (Open
CL)

CPU Test

One Thread
[ Driver
CPU]

Running driver code on simulated CPU is
impossible due to simulation run-times.

Intent Capture is a mechanism to allow existing
discrete GPU graphics tests to execute on the CPU
in a Heterogeneous APU simulation.

Intent Capture
Capture

Other
Threads

sp3 shader

Output

Replay()

CX
Shell

.sim
memory
image

APU RTL
Sim

Test
Output

Runs

 The memory accesses and configuration writes from the test are extracted into C function calls

 Intent Capture performs this activity and encapsulates the GPU test into a function called Replay.
 On CPU side, one thread runs Replay function while other threads execute the CPU side of the test.
 Composite test (CPU test + generated FusionReplay function) is compiled using cxshell to generate a .sim
memory image

15

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Ref [4]
POWER MANAGEMENT: BAPM

Multiple
Boost
Pstates

Pb0

...
Core cores
Pwr @ Pbase
Core
Pwr

Core
Pwr

Rest
of
APU
Pwr

Die Temp 

APU Pwr 

Pbx

Rest
of
APU
Pwr

App1 with

Rest
of
APU
Pwr

App2 with

Low CAC Allcores active

SWP0

P1

SWP1

…

…

HW
View

App3 with

High CAC
Med CAC HalfAll-cores active cores active

P0/Pbase

SW/OS
View

ILLUSTRATION WITH
CPU-CENTRIC SCENARIO

Ref[2]

CPU Core1

CPU Core2

Compute
Unit
Power
Monitor
calculates
CPU
Power

If Temp > Limit, reduce power allocation





Firmware
converts power
into
temperature
estimates

Compare
Temperature to
Limit & adjust
Voltage/Frequency

GPU
Power
Monitor
calculates
GPU
Power

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

GPU Core2

If Temp < Limit, increase power allocation

In a multi-core design, apps running on CPU/GPU cores may consume less power
Power-efficient algorithms exploit this power headroom for performance
The GPU can borrow power credit from the CPU in GPU-centric scenarios and vice versa
16

GPU Core1
BAPM VERIFICATION APPROACH @ SOC
•
CPU Core1

CPU
Power
Monitor

CPU Core2

CPU
Power
Monitor

•
•

NB

CAC
Manager

•
•
SMU F/W

GPU Core1

GPU
Power
Monitor

•

GPU Core2

GPU
Power
Monitor

•
•

Developed high and low power consuming CPU
patterns based on micro-architecture and power
analysis.
Interleaved high and low power patterns in random
stimulus
Used an Irritator to manipulate the credits sent to
CAC manager at times to hit corner cases like
back-to-back boost/throttle

Modeled F/W algorithm using a simple BFM
Added CSR framework to drive read/write to CAC
manager
A very few sanity tests run with real f/w loaded
through backdoor to check the end-to-end flow.

Used irritators to model GPU power credit
CPU-centric
reporting instead of running GPU applications.
GPU power monitor verified at GPU IP level

Efficient Coverage-driven random verification
 CPU boosted because of GPU giving away credits and vice versa
 Crosses of CPU/GPU events and effect on BAPM
17

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Multiple
Boost
Pstates
SOC VERIFICATION
CHALLENGES & SOLUTION

1

2

HSA

KAVERI
PLATFORM
COHERENCY

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

3

4

SoC
KAVERI
VERIFICATION VERIFICATION
TEST STIMULUS REUSE AND PORTING TO SOC
Tool and flow differences/set-up across IP and SOC, make stimulus reuse difficult.
Using functional model to simulate IP[RTL] in SoC scenario
for IP test development and easy porting to SoC

cMemory
Memory Model

Test setup update @ IP level to support test run with SOC
as a new target

Export suite, test key

MPMM

MEMIO
Memory
Model

IP2SoC
script
UNB Perf options

CPU to GPU access

GPU C
Model

CPU C
Model/RTL
Bus Unit

A simple HSA SOC test with 1 Rd-WR in RTL takes about 18
hours whereas it is <1 hour on the Heterogeneous C model

Intent Capture and Playback methodology

DV Test

GPU C
Model

Test
Output

Common test options

reports
sim output
run_job command-line options: directories
GNB,XNB,UNB

Goal: Improve Quality, Reduce development time
19

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Run/Execute
Regression

NB/DCT prog. options

Test
Output

APU

Create job spec
[ip2soc –merge]

Test setup update such as configuration changes, test stimulus
defines allowed IP test to be reused.

Capture
Output
Replay
Capture
Output

Memory config
Perf_options.yml
HW-SW INTERACTION: MODELING AND ABSTRACTION
HW-SW INTERACTION : MODELING & ABSTRACTION
Complex and evolving logic moving from hardware to firmware for better controllability. Challenges:
 Firmware algorithms are compute-intensive and often developed late in design cycle.
 Additional challenge to Verification in terms of load and execute time of the software.
Connected Standby Verification Approach

 Model the relevant section of the software using BFM with proper interface to the hardware
 Add sufficient controllability to stress different paths of the BFM model - find coverage
 Adaptive stimulus based on coverage of the BFM/state-machine

Goals: Improve Quality, Reduce development time
20

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
ADAPTIVE STIMULUS
Typically, power management transitions kick off after active code execution stops. This results in deeper
corner cases associated with thread-level coordination in multi-core design.
 Predicting occurrences of deeper phases and targeting those by code/stimulus is difficult.

 Define the power management modes as state machines - each state having granular phases including
thread specific information.
 Dynamic irritator monitors these state transitions, inserts random/directed asynchronous events (like different
sorts of interrupts, probes, warmreset) and updates a scoreboard.
 Events are generated very close to the relevant points - provides great controllability.
 Dynamic irritator adapts based on scoreboard statistics - eventually putting more weightage to the less
frequently covered <state> X <event> buckets.

Goals: Improve Quality, Reduce development time
21

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
CONSTRAINT RANDOM STIMULUS AND RANDOMIZATION AT SOC
Random Initial
States

S11

S21

St

St

S23

St

St

Ref[3]

Complex SoC requires Randomization at different levels
SOC Constraints
IP Constraints
Register
Fuse
Modes: LFBR,
BfD,long_init/
unfused test

Run

Build
Randomization
utility
Package
level info

RandomConfig
executable

Time t=0 [config values
Import value after reset
CMD line
options

Goals: Improve Quality, Reduce development time
22

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
OVERCOMING LIMITATIONS OF GATE-LEVEL SIMULATION
Challenges with Netlist simulation :
Longer run-times
Longer debug times


Approach to minimize runtime: Compute intensive RTL and associated verification components must be
replaced with a less intensive test-vector applicator : Apply test vectors directly from FSDB file.
Create Gatesim files
(gatesim.v,forces.v )

Run RTL
simulation,get FSDB

Build w Netlist + Gatesim
files + TB to drive stimulus
from FSDB

10x runtime optimization over traditional approach.
 Approach to minimize Debug effort: Verdi NPI based Methodology to automate Debug:

Ref [5]

23

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Goals: Improve Quality, Reduce development time

Run Netlist
sims(with
FSDB dump)
THANKYOU

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
REFERENCES
[1] A New Parallel Computing Platform – HSA, CTHPC 2013 Keynote
Speech, Roy Ju, AMD Senior Fellow
[2] AMD APUs :Dynamic Power Management Techniques, DAC 2013.
Praveen Dongara, System Architect
[3] Wilson Research Group-MGC 2013.
[4] Kaveri DTP. Internal Document.
[5] Innovative Approach to Overcome Limitations of Netlist Simulation,
SUNG 2013. Prodip K, Pankaj S,Meera M, Narendran K

25

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
GLOSSARY
 GPU – Graphics processing unit
 APU -- Accelerated Processing Unit
 Open CL™ -- Open Computing Language
 TDP – Thermal Design power – a measure of a design infrastructure’s ability to cool a device
 AMD Turbo Core Technology – AMD boost mechanism
 BIAPM -- Bi-directional Application Power Management.
 Cac -- Capacitance AC switching, measures switching activity of a cluster
 TDP -- Thermal Design Power, represents the average thermal dissipation power required to cool the design
 Pstate -- Processor performance state
 GARLIC -- Graphic Accelerated Reduced Latency Integrated Channel
 ONION -- On-chip Northbridge to I/O Noncoherent bus
 FSDB – Fast Signal Database

26

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
BACKUP

27

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013
DYNAMIC FINE-GRAINED POWER TRANSFERS
The dynamically calculated temperature of
each core and the GPU enables the
operating point of each to be dynamically
balanced in-order to maximize
performance within temperature limits.

Low activity in one core enables it to be a
thermal sink for a more active core

100.0

100.0

95.0

95.0

90.0

95.0

90.0

85.0

90.0

85.0

80.0

85.0

80.0

75.0

80.0

75.0

GPU-centric

28

100.0

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

75.0

Balanced

Ref [2]

CPU-centric
Disclaimer
The information presented in this document is for informational purposes only and may contain technical inaccuracies,
omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not
limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases,
product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD
assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this
information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD makes no representations or warranties with respect to the contents hereof and assumes no responsibility for any
inaccuracies, errors or omissions that appear in this information.
AMD specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. In no event will AMD
be liable to any person for any direct, indirect, special or other consequential damages arising from the use of any information
contained herein, even if AMD is expressly advised of the possibility of such damages.

Trademark Attribution
AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United
States and/or other jurisdictions. Open CL and the Open CL logo are trademarks of Apple, Inc. and used by permission or
Khronos. Microsoft, Windows and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other
jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their
respective owners.
©2011 Advanced Micro Devices, Inc. All rights reserved.

29

| 11th Intl. SoC Conference| Oct 23rd,24th, 2013

Contenu connexe

Tendances

4+yr Hardware Design Engineer_Richa
4+yr Hardware Design Engineer_Richa4+yr Hardware Design Engineer_Richa
4+yr Hardware Design Engineer_Richa
Richa Verma
 
Day 1 - 01 - Welcome
Day 1 - 01 - WelcomeDay 1 - 01 - Welcome
Day 1 - 01 - Welcome
webhostingguy
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)
Linaro
 

Tendances (20)

IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
 
4+yr Hardware Design Engineer_Richa
4+yr Hardware Design Engineer_Richa4+yr Hardware Design Engineer_Richa
4+yr Hardware Design Engineer_Richa
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Day 1 - 01 - Welcome
Day 1 - 01 - WelcomeDay 1 - 01 - Welcome
Day 1 - 01 - Welcome
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common KernelLAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
 
BKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream StategyBKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream Stategy
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Implementation of Soft-core Processor on FPGA
Implementation of Soft-core Processor on FPGAImplementation of Soft-core Processor on FPGA
Implementation of Soft-core Processor on FPGA
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao Jose
 
Qualcomm Hexagon SDK: Optimize Your Multimedia Solutions
Qualcomm Hexagon SDK: Optimize Your Multimedia SolutionsQualcomm Hexagon SDK: Optimize Your Multimedia Solutions
Qualcomm Hexagon SDK: Optimize Your Multimedia Solutions
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
ASIC design Flow (Digital Design)
ASIC design Flow (Digital Design)ASIC design Flow (Digital Design)
ASIC design Flow (Digital Design)
 
Tech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product UpdateTech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product Update
 
Webinar on RISC-V
Webinar on RISC-VWebinar on RISC-V
Webinar on RISC-V
 
WALT vs PELT : Redux - SFO17-307
WALT vs PELT : Redux  - SFO17-307WALT vs PELT : Redux  - SFO17-307
WALT vs PELT : Redux - SFO17-307
 

En vedette

Verification of Graphics ASICs (Part II)
Verification of Graphics ASICs (Part II)Verification of Graphics ASICs (Part II)
Verification of Graphics ASICs (Part II)
DVClub
 
Verification of Graphics ASICs (Part I)
Verification of Graphics ASICs (Part I)Verification of Graphics ASICs (Part I)
Verification of Graphics ASICs (Part I)
DVClub
 
Validating Next Generation CPUs
Validating Next Generation CPUsValidating Next Generation CPUs
Validating Next Generation CPUs
DVClub
 
Intel Atom Processor Pre-Silicon Verification Experience
Intel Atom Processor Pre-Silicon Verification ExperienceIntel Atom Processor Pre-Silicon Verification Experience
Intel Atom Processor Pre-Silicon Verification Experience
DVClub
 
Efficiency Through Methodology
Efficiency Through MethodologyEfficiency Through Methodology
Efficiency Through Methodology
DVClub
 
Intel Xeon Pre-Silicon Validation: Introduction and Challenges
Intel Xeon Pre-Silicon Validation: Introduction and ChallengesIntel Xeon Pre-Silicon Validation: Introduction and Challenges
Intel Xeon Pre-Silicon Validation: Introduction and Challenges
DVClub
 
Pre-Si Verification for Post-Si Validation
Pre-Si Verification for Post-Si ValidationPre-Si Verification for Post-Si Validation
Pre-Si Verification for Post-Si Validation
DVClub
 

En vedette (12)

Verification of Graphics ASICs (Part II)
Verification of Graphics ASICs (Part II)Verification of Graphics ASICs (Part II)
Verification of Graphics ASICs (Part II)
 
Verification of Graphics ASICs (Part I)
Verification of Graphics ASICs (Part I)Verification of Graphics ASICs (Part I)
Verification of Graphics ASICs (Part I)
 
Validating Next Generation CPUs
Validating Next Generation CPUsValidating Next Generation CPUs
Validating Next Generation CPUs
 
Intel Atom Processor Pre-Silicon Verification Experience
Intel Atom Processor Pre-Silicon Verification ExperienceIntel Atom Processor Pre-Silicon Verification Experience
Intel Atom Processor Pre-Silicon Verification Experience
 
Efficiency Through Methodology
Efficiency Through MethodologyEfficiency Through Methodology
Efficiency Through Methodology
 
Intel Xeon Pre-Silicon Validation: Introduction and Challenges
Intel Xeon Pre-Silicon Validation: Introduction and ChallengesIntel Xeon Pre-Silicon Validation: Introduction and Challenges
Intel Xeon Pre-Silicon Validation: Introduction and Challenges
 
Pre-Si Verification for Post-Si Validation
Pre-Si Verification for Post-Si ValidationPre-Si Verification for Post-Si Validation
Pre-Si Verification for Post-Si Validation
 
UVM: Basic Sequences
UVM: Basic SequencesUVM: Basic Sequences
UVM: Basic Sequences
 
Functional verification techniques EW16 session
Functional verification techniques  EW16 sessionFunctional verification techniques  EW16 session
Functional verification techniques EW16 session
 
Verification challenges and methodologies - SoC and ASICs
Verification challenges and methodologies - SoC and ASICsVerification challenges and methodologies - SoC and ASICs
Verification challenges and methodologies - SoC and ASICs
 
UVM Methodology Tutorial
UVM Methodology TutorialUVM Methodology Tutorial
UVM Methodology Tutorial
 
Coverage and Introduction to UVM
Coverage and Introduction to UVMCoverage and Introduction to UVM
Coverage and Introduction to UVM
 

Similaire à AMD_11th_Intl_SoC_Conf_UCI_Irvine

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
Alan Su
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013
ecoumans
 
asap2013-khoa-presentation
asap2013-khoa-presentationasap2013-khoa-presentation
asap2013-khoa-presentation
Abhishek Jain
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
IndicThreads
 

Similaire à AMD_11th_Intl_SoC_Conf_UCI_Irvine (20)

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013GPU Rigid Body Simulation GDC 2013
GPU Rigid Body Simulation GDC 2013
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
 
asap2013-khoa-presentation
asap2013-khoa-presentationasap2013-khoa-presentation
asap2013-khoa-presentation
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 

Plus de Pankaj Singh

Overcoming challenges of_verifying complex mixed signal designs
Overcoming challenges of_verifying complex mixed signal designsOvercoming challenges of_verifying complex mixed signal designs
Overcoming challenges of_verifying complex mixed signal designs
Pankaj Singh
 

Plus de Pankaj Singh (9)

An Approach to Overcome Modeling Inaccuracies for Performance Simulation Sig...
An Approach to Overcome Modeling  Inaccuracies for Performance Simulation Sig...An Approach to Overcome Modeling  Inaccuracies for Performance Simulation Sig...
An Approach to Overcome Modeling Inaccuracies for Performance Simulation Sig...
 
Unified methodology for effective correlation of soc power
Unified methodology for effective correlation of soc powerUnified methodology for effective correlation of soc power
Unified methodology for effective correlation of soc power
 
Overcoming challenges of_verifying complex mixed signal designs
Overcoming challenges of_verifying complex mixed signal designsOvercoming challenges of_verifying complex mixed signal designs
Overcoming challenges of_verifying complex mixed signal designs
 
Qualifying a high performance memory subsysten for Functional Safety
Qualifying a high performance memory subsysten for Functional SafetyQualifying a high performance memory subsysten for Functional Safety
Qualifying a high performance memory subsysten for Functional Safety
 
Safety Verification and Software aspects of Automotive SoC
Safety Verification and Software aspects of Automotive SoCSafety Verification and Software aspects of Automotive SoC
Safety Verification and Software aspects of Automotive SoC
 
Thesis
ThesisThesis
Thesis
 
Managing securityforautomotivesoc
Managing securityforautomotivesocManaging securityforautomotivesoc
Managing securityforautomotivesoc
 
Panel:The secret of Indian leadership in Electronic Design skill... From Desi...
Panel:The secret of Indian leadership in Electronic Design skill... From Desi...Panel:The secret of Indian leadership in Electronic Design skill... From Desi...
Panel:The secret of Indian leadership in Electronic Design skill... From Desi...
 
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design
Power Optimization with Efficient Test Logic Partitioning for Full Chip DesignPower Optimization with Efficient Test Logic Partitioning for Full Chip Design
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design
 

Dernier

Dernier (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

AMD_11th_Intl_SoC_Conf_UCI_Irvine

  • 1. Platform Coherency and SoC Verification Challenges PANKAJ SINGH, CHETHAN-RAJ M , PRAKASH RAGHAVENDRA, ANINDYASUNDAR NANDI, DIBYENDU DAS AND TONY TYE THE 11TH INTERNATIONAL SYSTEM-ON-CHIP (SOC) CONFERENCE, EXHIBIT, AND WORKSHOPS, OCTOBER 2013, IRVINE, CALIFORNIA WWW.SOCCONFERENCE.COM ACKNOWLEDGEMENTS: PHIL ROGERS AMD CORPORATE FELLOW , ROY JU & BEN SANDER SR FELLOW NARENDRA KAMAT, PRAVEEN DONGARA AND LEE HOWES
  • 2. TODAY’S TOPICS A New Parallel Computing Platform – Heterogeneous System Architecture Opportunities, Benefits and Feature Roadmap Kaveri Platform Coherency Shared memory, Platform atomics Kaveri Verification Approach SoC Verification Challenges and Solutions 1 HSA 2 2 KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  • 3. A New Parallel Computing Platform – Heterogeneous System Architecture (HSA) 1 2 HSA KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  • 4. APU: ACCELERATED PROCESSING UNIT The APU is a great advance compared to previous platforms CPU pair Combines scalar processing on CPU with parallel processing on the GPU and high-bandwidth access to memory Challenge: How do we make it even better going forward?  Easier to program  Easier to optimize  Easier to load balance  Higher performance  Lower power 4 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 GPU SIMD
  • 5. THE HSA OPPORTUNITY ON MODERN APPLICATIONS PROBLEM SOLUTION  HSA + Libraries = productivity & performance with low power Developer Return Few M HSA coders (Differentiation in performance, reduced power, features, time to market) Few 100Ks HSA apps  GPU/HW blocks hard to program  Not all workloads accelerate Wide range of differentiated experiences PROBLEM  Historically, developers program CPUs ~20+M* CPU coders ~4M apps Good user experiences Developer Investment (Effort, time, new skills) *IDC 5 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Tens of Ks GPU coders Few hundred apps Significant niche value
  • 6. HSA AND ITS BENEFITS HSA IS A COMPUTING PLATFORM THAT DRIVES NEW CLASS OF APPLICATIONS App-Accelerated Software Applications Graphics Workloads Data-Parallel Workloads Serial and Task-Parallel Workloads HSA is an enabler of APU’s higher performance and power efficiency Our industry-leading APUs speed up applications beyond graphics CPU and GPU (APUs) work cooperatively together directly in system memory Makes programming the APU as easy as C++ Improves Performance per watt 6 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Ref [1]
  • 7. HSA EFFICIENCY IMPROVEMENT (AN EXAMPLE) Improves Power and Performance: Move application from CPU to GPU, remove data copies, and reduce launch time 35 W Measured Power 25 fps 20 fps 30 W 25 W DRAM NB+GPU DRAM 15 fps 20 W NB+GPU 15 W 10 W Measured Perf 10 fps CPU Cores CPU Cores 5W 5 fps CPU+GPU 0 fps 0W CPU CPU Simulate removing memory copies: 1.32 X CPU+GPU  1.11 * 2.88 * 1.32 = 4.22 X Better Energy Efficiency  Easier to Program + Remove Copies ENERGY COMPUTATION BREAKDOWN: MOTIONDSP 720P VIDEO CLEAN-UP 7 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Ref [1]
  • 8. HETEROGENEOUS SYSTEM ARCHITECTURE FEATURE ROADMAP Physical Integration Optimized Platforms Integrate CPU & GPU in silicon Architectural Integration System Integration Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User Mode Schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Common Manufacturing Technology 8 GPU Compute C++ support Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Quality of Service | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 9. PLATFORM COHERENCY 1 2 HSA KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  • 10. KAVERI SOC – ENABLING SHARED MEMORY AND PLATFORM ATOMICS Shared memory accesses between the CPU and GPU happens via ‘system memory’. – Corresponds to the notion of shared virtual memory (SVM) in OpenCL 2.0, available via clSVMalloc() call. With SVM, CPUs and GPUs can share an address space and share the pointer to the same memory location. – The compiler supports SVM and atomics calls that work across the CPU-GPU boundary. – System-memory accesses may go one of three paths  If coherence with CPU is not required: GARLIC path  If kernel-granularity coherence with CPU is required: ONION bus path  If instruction-granularity coherence with CPU is required: Bypass L2 via ONION+ bus (required by atomics) 10 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 11. CONCURRENT STACK PUSH USING ATOMIC COMPARE-ANDEXCHANGE (AN EXAMPLE) Each CPU thread and each GPU workitem execute the following code concurrently:  The code shows an example implementation of a concurrent stack’s “push” operation.  The “compare_exchange_strong” is an atomic call that ensures only one of the CPU/GPU thread/workitem succeeds in updating the “head” pointer of the stack stored in list[0] do { head = list[0]; //redundant because the atomic call updates head on failure list[i] = head; } while (!atomic_compare_exchange_strong(&list[0], &head,i)); 0 3 0 2 1 1 2 2 3 3 5 3 5 4 4 5 -1 … 5 i=2 and i=4 contest for ACE (List: 3 (head)->5->-1) 99 Time Instant Workitem i=2 … 99 List after i=2 wins! (List: 2 (head)->3->5->-1) Workitem i=4 Before ACE head=3, list[2]=3 head=3,list[4]=3 ACE Wins! After ACE completes list[0]=2 11 -1 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Loses and goes back & retries list[0]=2
  • 12. IMPLEMENTING PLATFORM ATOMICS FOR KAVERI  The compiler has implemented these atomics (per OpenCL 2.0 standards) for Kaveri.  The key issue in implementing these atomics is to make sure that both CPU and GPU see the shared memory in “coherent” state.  The coherency is implemented using the ONION+ memory path and using the GPU ISA instructions, which can invalidate/bypass L1/L2 caches selectively from the GPU side and snoop to invalidate the CPU caches. This support is provided in the KV SOC.  For example: atomic_load with acquire semantics generates code on the GPU side as shown (in Kaveri L2 is always bypassed for coherent access). Similarly, atomic_store with release semantics generates the GPU ISA given later. 1. load with glc=1 2. S_waitcnt 0 3. buffer_wbinv_vol // bypass the L1 cache 1. s_waitcnt 0 2. store with glc=0 // wait for any previous memop to complete // L1 is a write-through cache, so write onto memory as L2 is bypassed // prevent any following memop to move up 3. s_waitcnt 0; // wait for the load to complete // invalidate L1 so that any following load reads from memory  OpenCL 2.0 and C11 atomics support various kinds of memory_scope & memory_ordering 12 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 13. KAVERI SOC VERIFICATION APPROACH 1 2 HSA KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  • 14. TRADITIONAL VERIFICATION AND SOC CHALLENGE CPU NorthBridge DRAM Model Graphics model GFX SouthBridge BFM CPU-BASED VERIFICATION  Assembly based input  Memory image of x86 machine code is preloaded into DRAM model  CPU fetches instructions from DRAM and executes them GPU-BASED VERIFICATION  Higher language (C/C++)  BFM model used across PCIe-based interface to inject data  GPU sends requests to DRAM over 2 paths: coherent and non-coherent SoC Verification Challenge  Layer of complexity due to HSA coherency environment.  SoC GPU needs to be programmed, which requires host  SoC CPU can be used the host. However, running the same host software stack results in huge simulation time  One approach is Mailbox:  Inefficient due to lack of CPU-GPU interaction, longer run time.  GPU-focused verification not suitable for CPU-GPU interaction (HSA) 14 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 15. SOC VERIFICATION METHODOLOGY: TEST FLOW GPU Test Test (Open CL) CPU Test One Thread [ Driver CPU] Running driver code on simulated CPU is impossible due to simulation run-times. Intent Capture is a mechanism to allow existing discrete GPU graphics tests to execute on the CPU in a Heterogeneous APU simulation. Intent Capture Capture Other Threads sp3 shader Output Replay() CX Shell .sim memory image APU RTL Sim Test Output Runs  The memory accesses and configuration writes from the test are extracted into C function calls  Intent Capture performs this activity and encapsulates the GPU test into a function called Replay.  On CPU side, one thread runs Replay function while other threads execute the CPU side of the test.  Composite test (CPU test + generated FusionReplay function) is compiled using cxshell to generate a .sim memory image 15 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Ref [4]
  • 16. POWER MANAGEMENT: BAPM Multiple Boost Pstates Pb0 ... Core cores Pwr @ Pbase Core Pwr Core Pwr Rest of APU Pwr Die Temp  APU Pwr  Pbx Rest of APU Pwr App1 with Rest of APU Pwr App2 with Low CAC Allcores active SWP0 P1 SWP1 … … HW View App3 with High CAC Med CAC HalfAll-cores active cores active P0/Pbase SW/OS View ILLUSTRATION WITH CPU-CENTRIC SCENARIO Ref[2] CPU Core1 CPU Core2 Compute Unit Power Monitor calculates CPU Power If Temp > Limit, reduce power allocation    Firmware converts power into temperature estimates Compare Temperature to Limit & adjust Voltage/Frequency GPU Power Monitor calculates GPU Power | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 GPU Core2 If Temp < Limit, increase power allocation In a multi-core design, apps running on CPU/GPU cores may consume less power Power-efficient algorithms exploit this power headroom for performance The GPU can borrow power credit from the CPU in GPU-centric scenarios and vice versa 16 GPU Core1
  • 17. BAPM VERIFICATION APPROACH @ SOC • CPU Core1 CPU Power Monitor CPU Core2 CPU Power Monitor • • NB CAC Manager • • SMU F/W GPU Core1 GPU Power Monitor • GPU Core2 GPU Power Monitor • • Developed high and low power consuming CPU patterns based on micro-architecture and power analysis. Interleaved high and low power patterns in random stimulus Used an Irritator to manipulate the credits sent to CAC manager at times to hit corner cases like back-to-back boost/throttle Modeled F/W algorithm using a simple BFM Added CSR framework to drive read/write to CAC manager A very few sanity tests run with real f/w loaded through backdoor to check the end-to-end flow. Used irritators to model GPU power credit CPU-centric reporting instead of running GPU applications. GPU power monitor verified at GPU IP level Efficient Coverage-driven random verification  CPU boosted because of GPU giving away credits and vice versa  Crosses of CPU/GPU events and effect on BAPM 17 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Multiple Boost Pstates
  • 18. SOC VERIFICATION CHALLENGES & SOLUTION 1 2 HSA KAVERI PLATFORM COHERENCY | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 3 4 SoC KAVERI VERIFICATION VERIFICATION
  • 19. TEST STIMULUS REUSE AND PORTING TO SOC Tool and flow differences/set-up across IP and SOC, make stimulus reuse difficult. Using functional model to simulate IP[RTL] in SoC scenario for IP test development and easy porting to SoC cMemory Memory Model Test setup update @ IP level to support test run with SOC as a new target Export suite, test key MPMM MEMIO Memory Model IP2SoC script UNB Perf options CPU to GPU access GPU C Model CPU C Model/RTL Bus Unit A simple HSA SOC test with 1 Rd-WR in RTL takes about 18 hours whereas it is <1 hour on the Heterogeneous C model Intent Capture and Playback methodology DV Test GPU C Model Test Output Common test options reports sim output run_job command-line options: directories GNB,XNB,UNB Goal: Improve Quality, Reduce development time 19 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Run/Execute Regression NB/DCT prog. options Test Output APU Create job spec [ip2soc –merge] Test setup update such as configuration changes, test stimulus defines allowed IP test to be reused. Capture Output Replay Capture Output Memory config Perf_options.yml
  • 20. HW-SW INTERACTION: MODELING AND ABSTRACTION HW-SW INTERACTION : MODELING & ABSTRACTION Complex and evolving logic moving from hardware to firmware for better controllability. Challenges:  Firmware algorithms are compute-intensive and often developed late in design cycle.  Additional challenge to Verification in terms of load and execute time of the software. Connected Standby Verification Approach  Model the relevant section of the software using BFM with proper interface to the hardware  Add sufficient controllability to stress different paths of the BFM model - find coverage  Adaptive stimulus based on coverage of the BFM/state-machine Goals: Improve Quality, Reduce development time 20 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 21. ADAPTIVE STIMULUS Typically, power management transitions kick off after active code execution stops. This results in deeper corner cases associated with thread-level coordination in multi-core design.  Predicting occurrences of deeper phases and targeting those by code/stimulus is difficult.  Define the power management modes as state machines - each state having granular phases including thread specific information.  Dynamic irritator monitors these state transitions, inserts random/directed asynchronous events (like different sorts of interrupts, probes, warmreset) and updates a scoreboard.  Events are generated very close to the relevant points - provides great controllability.  Dynamic irritator adapts based on scoreboard statistics - eventually putting more weightage to the less frequently covered <state> X <event> buckets. Goals: Improve Quality, Reduce development time 21 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 22. CONSTRAINT RANDOM STIMULUS AND RANDOMIZATION AT SOC Random Initial States S11 S21 St St S23 St St Ref[3] Complex SoC requires Randomization at different levels SOC Constraints IP Constraints Register Fuse Modes: LFBR, BfD,long_init/ unfused test Run Build Randomization utility Package level info RandomConfig executable Time t=0 [config values Import value after reset CMD line options Goals: Improve Quality, Reduce development time 22 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 23. OVERCOMING LIMITATIONS OF GATE-LEVEL SIMULATION Challenges with Netlist simulation : Longer run-times Longer debug times  Approach to minimize runtime: Compute intensive RTL and associated verification components must be replaced with a less intensive test-vector applicator : Apply test vectors directly from FSDB file. Create Gatesim files (gatesim.v,forces.v ) Run RTL simulation,get FSDB Build w Netlist + Gatesim files + TB to drive stimulus from FSDB 10x runtime optimization over traditional approach.  Approach to minimize Debug effort: Verdi NPI based Methodology to automate Debug: Ref [5] 23 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 Goals: Improve Quality, Reduce development time Run Netlist sims(with FSDB dump)
  • 24. THANKYOU | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 25. REFERENCES [1] A New Parallel Computing Platform – HSA, CTHPC 2013 Keynote Speech, Roy Ju, AMD Senior Fellow [2] AMD APUs :Dynamic Power Management Techniques, DAC 2013. Praveen Dongara, System Architect [3] Wilson Research Group-MGC 2013. [4] Kaveri DTP. Internal Document. [5] Innovative Approach to Overcome Limitations of Netlist Simulation, SUNG 2013. Prodip K, Pankaj S,Meera M, Narendran K 25 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 26. GLOSSARY  GPU – Graphics processing unit  APU -- Accelerated Processing Unit  Open CL™ -- Open Computing Language  TDP – Thermal Design power – a measure of a design infrastructure’s ability to cool a device  AMD Turbo Core Technology – AMD boost mechanism  BIAPM -- Bi-directional Application Power Management.  Cac -- Capacitance AC switching, measures switching activity of a cluster  TDP -- Thermal Design Power, represents the average thermal dissipation power required to cool the design  Pstate -- Processor performance state  GARLIC -- Graphic Accelerated Reduced Latency Integrated Channel  ONION -- On-chip Northbridge to I/O Noncoherent bus  FSDB – Fast Signal Database 26 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 27. BACKUP 27 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013
  • 28. DYNAMIC FINE-GRAINED POWER TRANSFERS The dynamically calculated temperature of each core and the GPU enables the operating point of each to be dynamically balanced in-order to maximize performance within temperature limits. Low activity in one core enables it to be a thermal sink for a more active core 100.0 100.0 95.0 95.0 90.0 95.0 90.0 85.0 90.0 85.0 80.0 85.0 80.0 75.0 80.0 75.0 GPU-centric 28 100.0 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013 75.0 Balanced Ref [2] CPU-centric
  • 29. Disclaimer The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD makes no representations or warranties with respect to the contents hereof and assumes no responsibility for any inaccuracies, errors or omissions that appear in this information. AMD specifically disclaims any implied warranties of merchantability or fitness for any particular purpose. In no event will AMD be liable to any person for any direct, indirect, special or other consequential damages arising from the use of any information contained herein, even if AMD is expressly advised of the possibility of such damages. Trademark Attribution AMD, the AMD Arrow logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Open CL and the Open CL logo are trademarks of Apple, Inc. and used by permission or Khronos. Microsoft, Windows and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2011 Advanced Micro Devices, Inc. All rights reserved. 29 | 11th Intl. SoC Conference| Oct 23rd,24th, 2013